Explain the concept of 'mixing time' in the context of Markov chains, and how can coupling be used to shorten this time?

Mixing time refers to the time a Markov chain requires to become approximately independent of its initial state, achieving a distribution close to its stationary distribution. Coupling shortens mixing time by pairing two instances of the chain to evolve together, ensuring they converge in state. When both chains synchronize, they highlight how close the original chain is to its stationary distribution. Suitably designing such couplings can dramatically reduce mixing time, especially when the Markov chain is rapidly mixing .

In the context of learning theory, how is the concept of ϵ-nets different from ϵ-samples, and what is the impact on algorithm performance?

The concept of $\epsilon$-nets and $\epsilon$-samples relates to how we select representative subsets of data in a statistical learning setting. An $\epsilon$-net is a set that ensures every subset with measure at least $\epsilon$ contains at least one point from the net, used to cover range spaces based on a distribution, and is connected to uniform convergence properties of VC dimension . An $\epsilon$-sample, on the other hand, is used for approximating the measure of such sets to an additive error of $\epsilon$. The key impact on algorithm performance is that the data requirements for both $\epsilon$-nets and $\epsilon$-samples can dictate sample complexity bounds in statistical learning algorithms . An $\epsilon$-sample can also serve as an $\epsilon$-net, suggesting lower sample complexity requirements apply to both .\

What implications does the birthday paradox have for understanding the distribution of balls into bins, and what is a key result concerning the maximum load?

The birthday paradox highlights how even a small number of balls (people) quickly leads to a high probability of sharing bins (birthdays). In the balls-and-bins model, this implies that with m = √n balls into n bins, it is likely that some bins have more than one ball. A key result states that the maximum load is above 3 ln n/ln ln n with probability at most 1/n for large n, illustrating the uneveness when distributing many items into categories .

What is the concept of conditional expectation, and how can it be applied to the example of rolling two dice?

The concept of conditional expectation involves calculating the expected value of a random variable given some condition or event, adjusting the weight of each value by the conditional probability it assumes. In the example of rolling two dice, if we let X1 be the outcome of the first die and X be the sum of both dice, the conditional expectation E[X | X1=2] is computed by summing over all possible values of X, each weighted by the probability that X achieves that value given X1=2, resulting in an expected value of 11/2 .

How does the Chernoff bound help in deriving better probability bounds for symmetric random variables, and what specific result is obtained for the sum of independent symmetric variables?

The Chernoff bound is useful for providing stronger probability bounds for sums of independent symmetric random variables. Applying the Chernoff bound to independent variables each taking values 1 or -1 with equal probability, a derived theorem states that the probability of their sum at least 'a' is bounded by exp(-a^2/2n). This shows that even for potentially large deviations from the mean, the probability decreases exponentially .

What role does coupling play in estimating the mixing time of the coloring chain, and how can mixing time be bounded in terms of vertices and ε?

Coupling is a technique used to bound the mixing time of a Markov chain by running two copies of the chain and ensuring they coalesce into the same state. This can efficiently prove fast mixing when the divergence between copies tends to decrease more often than increase . The mixing time τ(ε) can be bounded in terms of the number of vertices by considering polynomial dependence on the size of the graph and logarithmic dependence on 1/ε . Specifically, the mixing time is polynomial in the number of vertices n and log(1/ε), assuming the chain dynamics achieve a rapidly mixing state . If each vertex in a graph has maximum degree Δ and proper colorings are considered with conditions c ≥ 4Δ + 1, a trivial coupling demonstrates that the mixing time τ(ε) satisfies τ(ε) ≤ O(n(c−4Δ)ln(n)/(ε)). This dependence illustrates that with an increasing number of vertices, longer mixing times are necessary to achieve approximate uniform distributions, but these bounds can be computed efficiently.

Discuss how the EM algorithm can be applied to mixture models and describe its steps in estimating parameters for mixed Gaussian distributions.

The EM (Expectation-Maximization) algorithm finds maximum likelihood estimates for model parameters involving latent variables, such as mixture models. For bivariate normal distributions, EM alternates between expectation (E-step) and maximization (M-step). In the E-step, the expected value of latent variables is computed based on current parameter estimates; in the M-step, parameters are updated to maximize the likelihood of observed data given these expected values. Iterating these steps refines the estimates for the distribution parameters of the Gaussian components .

How does the VC Dimension practical importance in determining the sample complexity for learning algorithms, and what example could convey a bound for samples needed?

The VC Dimension measures the capacity of a hypothesis space to fit data and is crucial in deriving the sample complexity necessary for effectively learning a concept. For instance, to ensure with high probability (at least 1 - δ) that a learning algorithm produces a hypothesis with error at most ε, the sample size must be proportional to the VC Dimension, typically given by O((d/ε) ln (1/ε) + (1/ε) ln (1/δ)), where d is the VC Dimension. For closed intervals on ℝ, a VC Dimension of 2 suggests a ϵ-sample can be efficiently learned with the appropriate amount of data .

How can the method of conditional expectations be used to transform a randomized algorithm into a deterministic one, and provide an example from SAT problems?

The method of conditional expectations helps transform a randomized algorithm into a deterministic one by ensuring that each decision made in the algorithm keeps the expected success probability as high as possible. For instance, in SAT problems with m clauses having k literals each, a Las Vegas algorithm could satisfy at least m(1-2^-k) clauses. Using conditional expectations systematically directs decisions so that the expected number of satisfied clauses remains high, effectively converting the problem into a deterministic algorithm .

Describe how the Chernoff bound is used in derandomizing algorithms and provide an example of its application.

The Chernoff bound is utilized in derandomizing algorithms to show that the outcomes are concentrated around the expected value, thereby lowering the probability of significant deviations. This assures us that deterministic techniques can closely emulate randomized algorithms. For example, in partitioning graphs into 2-cuts, ensuring that expected outcomes (e.g., edges cut) meet certain criteria helps convert a probability-based approach into a deterministic one, guaranteeing L2-norm closeness between random and guaranteed results .

Open navigation menu

Upload

100% found this document useful (2 votes)

2K views490 pages

Probability and Computing 2nd Edition

Uploaded by

Camilo Mansilla

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

100% found this document useful (2 votes)

2K views490 pages

Probability and Computing 2nd Edition

Uploaded by

Camilo Mansilla

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 490

Probability and Computing

Randomization and probabilistic techniques play an important role in modern computer

science, with applications ranging from combinatorial optimization and machine learning
to communication networks and secure protocols.
This textbook provides an indispensable teaching tool to accompany a one- or two-
semester course for advanced undergraduate or beginning graduate students in computer
science and applied mathematics. It offers a comprehensive introduction to the role of ran-
domization and probabilistic techniques in modern computer science, in particular to tech-
niques and paradigms used in the development and probabilistic analysis of algorithms
and for data analyses. It assumes only an elementary background in discrete mathematics
and gives a rigorous yet accessible treatment of the material, with numerous examples and
applications.
The first half of the book covers core material, including random sampling, expecta-
tions, Markov’s inequality, Chebyshev’s inequality, Chernoff bounds, balls-and-bins mod-
els, the probabilistic method, and Markov chains. In the second half, the authors delve
into more advanced topics such as continuous probability, applications of limited indepen-
dence, entropy, Markov chain Monte Carlo methods, coupling, martingales, and balanced
allocations.
This greatly expanded new edition includes several newly added chapters and sec-
tions, covering topics including normal distributions, sample complexity, VC dimension,
Rademacher complexity, power laws and related distributions, cuckoo hashing, and appli-
cations of the Lovász Local Lemma. New material relevant to machine learning and big
data analysis enables students to learn up-to-date techniques and applications. Among the
many new exercises and examples are programming-related exercises that provide students
with practical experience and training related to the theoretical concepts covered in the text.

Michael Mitzenmacher is a Professor of Computer Science in the School of Engineering

and Applied Sciences at Harvard University, where he was also the Area Dean for Com-
puter Science from 2010 to 2013. Michael has authored or co-authored over 200 confer-
ence and journal publications on a variety of topics, including algorithms for the Internet,
efficient hash-based data structures, erasure and error-correcting codes, power laws, and
compression. His work on low-density parity-check codes shared the 2002 IEEE Informa-
tion Theory Society Best Paper Award and won the 2009 ACM SIGCOMM Test of Time
Award. He is an ACM Fellow, and was elected as the Chair of the ACM Special Interest
Group on Algorithms and Computation Theory in 2015.
Eli Upfal is a Professor of Computer Science at Brown University, where he was also the
department chair from 2002 to 2007. Prior to joining Brown in 1998, he was a researcher and
project manager at the IBM Almaden Research Center, and a Professor of Applied Math-
ematics and Computer Science at the Weizmann Institute of Science. His main research
interests are randomized algorithms, probabilistic analysis of algorithms, and computa-
tional statistics, with applications ranging from combinatorial and stochastic optimization,
computational biology, and computational finance. He is a Fellow of both the IEEE and the
ACM.
Probability and Computing
Randomization and Probabilistic
Techniques in Algorithms and
Data Analysis

Second Edition

Michael Mitzenmacher Eli Upfal

University Printing House, Cambridge CB2 8BS, United Kingdom
One Liberty Plaza, 20th Floor, New York, NY 10006, USA
477 Williamstown Road, Port Melbourne, VIC 3207, Australia
4843/24, 2nd Floor, Ansari Road, Daryaganj, Delhi - 110002, India
79 Anson Road, #06-04/06, Singapore 079906

Cambridge University Press is part of the University of Cambridge.

It furthers the University’s mission by disseminating knowledge in the pursuit of
education, learning, and research at the highest international levels of excellence.

www.cambridge.org
Information on this title: www.cambridge.org/9781107154889
10.1017/9781316651124
© Michael Mitzenmacher and Eli Upfal 2017
This publication is in copyright. Subject to statutory exception
and to the provisions of relevant collective licensing agreements,
no reproduction of any part may take place without the written
permission of Cambridge University Press.
First published 2017
Printed in the United States of America by Sheridan Books, Inc.
A catalogue record for this publication is available from the British Library.
Library of Congress Cataloging in Publication Data
Names: Mitzenmacher, Michael, 1969– author. | Upfal, Eli, 1954– author.
Title: Probability and computing / Michael Mitzenmacher Eli Upfal.
Description: Second edition. | Cambridge, United Kingdom ;
New York, NY, USA : Cambridge University Press, [2017] |
Includes bibliographical references and index.
Identifiers: LCCN 2016041654 | ISBN 9781107154889
Subjects: LCSH: Algorithms. | Probabilities. | Stochastic analysis.
Classification: LCC QA274.M574 2017 | DDC 518/.1 – dc23
LC record available at https://lccn.loc.gov/2016041654
ISBN 978-1-107-15488-9 Hardback
Additional resources for this publication at www.cambridge.org/Mitzenmacher.
Cambridge University Press has no responsibility for the persistence or accuracy
of URLs for external or third-party Internet Web sites referred to in this publication
and does not guarantee that any content on such Web sites is, or will remain,
accurate or appropriate.
To

Stephanie, Michaela, Jacqueline, and Chloe

M.M.

Liane, Tamara, and Ilan

E.U.
Contents

Preface to the Second Edition page xv

Preface to the First Edition xvii

1 Events and Probability 1

1.1 Application: Verifying Polynomial Identities 1
1.2 Axioms of Probability 3
1.3 Application: Verifying Matrix Multiplication 8
1.4 Application: Naïve Bayesian Classifier 12
1.5 Application: A Randomized Min-Cut Algorithm 15
1.6 Exercises 17

2 Discrete Random Variables and Expectation 23

2.1 Random Variables and Expectation 23
2.1.1 Linearity of Expectations 25
2.1.2 Jensen’s Inequality 26
2.2 The Bernoulli and Binomial Random Variables 27
2.3 Conditional Expectation 29
2.4 The Geometric Distribution 33
2.4.1 Example: Coupon Collector’s Problem 35
2.5 Application: The Expected Run-Time of Quicksort 37
2.6 Exercises 40

3 Moments and Deviations 47

3.1 Markov’s Inequality 47
3.2 Variance and Moments of a Random Variable 48
3.2.1 Example: Variance of a Binomial Random Variable 51

vii
contents

3.3 Chebyshev’s Inequality 51

3.3.1 Example: Coupon Collector’s Problem 53
3.4 Median and Mean 55
3.5 Application: A Randomized Algorithm for Computing the Median 57
3.5.1 The Algorithm 58
3.5.2 Analysis of the Algorithm 59
3.6 Exercises 62

4 Chernoff and Hoeffding Bounds 66

4.1 Moment Generating Functions 66
4.2 Deriving and Applying Chernoff Bounds 68
4.2.1 Chernoff Bounds for the Sum of Poisson Trials 68
4.2.2 Example: Coin Flips 72
4.2.3 Application: Estimating a Parameter 72
4.3 Better Bounds for Some Special Cases 73
4.4 Application: Set Balancing 76
4.5 The Hoeffding Bound 77
4.6∗ Application: Packet Routing in Sparse Networks 79
4.6.1 Permutation Routing on the Hypercube 80
4.6.2 Permutation Routing on the Butterfly 85
4.7 Exercises 90

5 Balls, Bins, and Random Graphs 97

5.1 Example: The Birthday Paradox 97
5.2 Balls into Bins 99
5.2.1 The Balls-and-Bins Model 99
5.2.2 Application: Bucket Sort 101
5.3 The Poisson Distribution 101
5.3.1 Limit of the Binomial Distribution 105
5.4 The Poisson Approximation 107
5.4.1∗ Example: Coupon Collector’s Problem, Revisited 111
5.5 Application: Hashing 113
5.5.1 Chain Hashing 113
5.5.2 Hashing: Bit Strings 114
5.5.3 Bloom Filters 116
5.5.4 Breaking Symmetry 118
5.6 Random Graphs 119
5.6.1 Random Graph Models 119
5.6.2 Application: Hamiltonian Cycles in Random Graphs 121
5.7 Exercises 127
5.8 An Exploratory Assignment 133

6 The Probabilistic Method 135

6.1 The Basic Counting Argument 135

viii
contents

6.2 The Expectation Argument 137

6.2.1 Application: Finding a Large Cut 138
6.2.2 Application: Maximum Satisfiability 139
6.3 Derandomization Using Conditional Expectations 140
6.4 Sample and Modify 142
6.4.1 Application: Independent Sets 142
6.4.2 Application: Graphs with Large Girth 143
6.5 The Second Moment Method 143
6.5.1 Application: Threshold Behavior in Random Graphs 144
6.6 The Conditional Expectation Inequality 145
6.7 The Lovász Local Lemma 147
6.7.1 Application: Edge-Disjoint Paths 150
6.7.2 Application: Satisfiability 151
6.8∗ Explicit Constructions Using the Local Lemma 152
6.8.1 Application: A Satisfiability Algorithm 152
6.9 Lovász Local Lemma: The General Case 155
6.10∗ The Algorithmic Lovász Local Lemma 158
6.11 Exercises 162

7 Markov Chains and Random Walks 168

7.1 Markov Chains: Definitions and Representations 168
7.1.1 Application: A Randomized Algorithm for 2-Satisfiability 171
7.1.2 Application: A Randomized Algorithm for 3-Satisfiability 174
7.2 Classification of States 178
7.2.1 Example: The Gambler’s Ruin 181
7.3 Stationary Distributions 182
7.3.1 Example: A Simple Queue 188
7.4 Random Walks on Undirected Graphs 189
7.4.1 Application: An s–t Connectivity Algorithm 192
7.5 Parrondo’s Paradox 193
7.6 Exercises 198

8 Continuous Distributions and the Poisson Process 205

8.1 Continuous Random Variables 205
8.1.1 Probability Distributions in R 205
8.1.2 Joint Distributions and Conditional Probability 208
8.2 The Uniform Distribution 210
8.2.1 Additional Properties of the Uniform Distribution 211
8.3 The Exponential Distribution 213
8.3.1 Additional Properties of the Exponential Distribution 214
8.3.2∗ Example: Balls and Bins with Feedback 216
8.4 The Poisson Process 218
8.4.1 Interarrival Distribution 221
ix
contents

8.4.2 Combining and Splitting Poisson Processes 222

8.4.3 Conditional Arrival Time Distribution 224
8.5 Continuous Time Markov Processes 226
8.6 Example: Markovian Queues 229
8.6.1 M/M/1 Queue in Equilibrium 230
8.6.2 M/M/1/K Queue in Equilibrium 233
8.6.3 The Number of Customers in an M/M/∞ Queue 233
8.7 Exercises 236

9 The Normal Distribution 242

9.1 The Normal Distribution 242
9.1.1 The Standard Normal Distribution 242
9.1.2 The General Univariate Normal Distribution 243
9.1.3 The Moment Generating Function 246
9.2∗ Limit of the Binomial Distribution 247
9.3 The Central Limit Theorem 249
9.4∗ Multivariate Normal Distributions 252
9.4.1 Properties of the Multivariate Normal Distribution 255
9.5 Application: Generating Normally Distributed Random Values 256
9.6 Maximum Likelihood Point Estimates 258
9.7 Application: EM Algorithm For a Mixture of Gaussians 261
9.8 Exercises 265

10 Entropy, Randomness, and Information 269

10.1 The Entropy Function 269
10.2 Entropy and Binomial Coefficients 272
10.3 Entropy: A Measure of Randomness 274
10.4 Compression 278
10.5∗ Coding: Shannon’s Theorem 281
10.6 Exercises 290

11 The Monte Carlo Method 297

11.1 The Monte Carlo Method 297
11.2 Application: The DNF Counting Problem 300
11.2.1 The Naïve Approach 300
11.2.2 A Fully Polynomial Randomized Scheme for DNF Counting 302
11.3 From Approximate Sampling to Approximate Counting 304
11.4 The Markov Chain Monte Carlo Method 308
11.4.1 The Metropolis Algorithm 310
11.5 Exercises 312
11.6 An Exploratory Assignment on Minimum Spanning Trees 315

x
contents

12 Coupling of Markov Chains 317

12.1 Variation Distance and Mixing Time 317
12.2 Coupling 320
12.2.1 Example: Shuffling Cards 321
12.2.2 Example: Random Walks on the Hypercube 322
12.2.3 Example: Independent Sets of Fixed Size 323
12.3 Application: Variation Distance Is Nonincreasing 324
12.4 Geometric Convergence 327
12.5 Application: Approximately Sampling Proper
Colorings 328
12.6 Path Coupling 332
12.7 Exercises 336

13 Martingales 341
13.1 Martingales 341
13.2 Stopping Times 343
13.2.1 Example: A Ballot Theorem 345
13.3 Wald’s Equation 346
13.4 Tail Inequalities for Martingales 349
13.5 Applications of the Azuma–Hoeffding Inequality 351
13.5.1 General Formalization 351
13.5.2 Application: Pattern Matching 353
13.5.3 Application: Balls and Bins 354
13.5.4 Application: Chromatic Number 355
13.6 Exercises 355

14 Sample Complexity, VC Dimension, and Rademacher

Complexity 361
14.1 The Learning Setting 362
14.2 VC Dimension 363
14.2.1 Additional Examples of VC Dimension 365
14.2.2 Growth Function 366
14.2.3 VC dimension component bounds 368
14.2.4 -nets and -samples 369
14.3 The -net Theorem 370
14.4 Application: PAC Learning 374
14.5 The -sample Theorem 377
14.5.1 Application: Agnostic Learning 379
14.5.2 Application: Data Mining 380
14.6 Rademacher Complexity 382
14.6.1 Rademacher Complexity and Sample Error 385

xi
contents

14.6.2 Estimating the Rademacher Complexity 387

14.6.3 Application: Agnostic Learning of a Binary Classification 388
14.7 Exercises 389

15 Pairwise Independence and Universal Hash Functions 392

15.1 Pairwise Independence 392
15.1.1 Example: A Construction of Pairwise Independent Bits 393
15.1.2 Application: Derandomizing an Algorithm for Large Cuts 394
15.1.3 Example: Constructing Pairwise Independent Values Modulo
a Prime 395
15.2 Chebyshev’s Inequality for Pairwise Independent Variables 396
15.2.1 Application: Sampling Using Fewer Random Bits 397
15.3 Universal Families of Hash Functions 399
15.3.1 Example: A 2-Universal Family of Hash Functions 401
15.3.2 Example: A Strongly 2-Universal Family of Hash Functions 402
15.3.3 Application: Perfect Hashing 404
15.4 Application: Finding Heavy Hitters in Data Streams 407
15.5 Exercises 411

16 Power Laws and Related Distributions 415

16.1 Power Law Distributions: Basic Definitions and Properties 416
16.2 Power Laws in Language 418
16.2.1 Zipf’s Law and Other Examples 418
16.2.2 Languages via Optimization 419
16.2.3 Monkeys Typing Randomly 419
16.3 Preferential Attachment 420
16.3.1 A Formal Version 422
16.4 Using the Power Law in Algorithm Analysis 425
16.5 Other Related Distributions 427
16.5.1 Lognormal Distributions 427
16.5.2 Power Law with Exponential Cutoff 428
16.6 Exercises 429

17 Balanced Allocations and Cuckoo Hashing 433

17.1 The Power of Two Choices 433
17.1.1 The Upper Bound 433
17.2 Two Choices: The Lower Bound 438
17.3 Applications of the Power of Two Choices 441
17.3.1 Hashing 441
17.3.2 Dynamic Resource Allocation 442
17.4 Cuckoo Hashing 442
17.5 Extending Cuckoo Hashing 452
17.5.1 Cuckoo Hashing with Deletions 452

xii
contents

17.5.2 Handling Failures 453

17.5.3 More Choices and Bigger Bins 454
17.6 Exercises 456

Note: Asterisks indicate advanced material for this chapter.

xiii
Preface to the Second Edition

In the ten years since the publication of the first edition of this book, probabilistic
methods have become even more central to computer science, rising with the growing
importance of massive data analysis, machine learning, and data mining. Many of the
successful applications of these areas rely on algorithms and heuristics that build on
sophisticated probabilistic and statistical insights. Judicious use of these tools requires
a thorough understanding of the underlying mathematical concepts. Most of the new
material in this second edition focuses on these concepts.
The ability in recent years to create, collect, and store massive data sets, such as
the World Wide Web, social networks, and genome data, lead to new challenges in
modeling and analyzing such structures. A good foundation for models and analysis
comes from understanding some standard distributions. Our new chapter on the nor-
mal distribution (also known as the Gaussian distribution) covers the most common
statistical distribution, as usual with an emphasis on how it is used in settings in com-
puter science, such as for tail bounds. However, an interesting phenomenon is that in
many modern data sets, including social networks and the World Wide Web, we do not
see normal distributions, but instead we see distributions with very different proper-
ties, most notably unusually heavy tails. For example, some pages in the World Wide
Web have an unusually large number of pages that link to them, orders of magnitude
larger than the average. The new chapter on power laws and related distributions covers
specific distributions that are important for modeling and understanding these kinds of
modern data sets.
Machine learning is one of the great successes of computer science in recent years,
providing efficient tools for modeling, understanding, and making predictions based on
large data sets. A question that is often overlooked in practical applications of machine
learning is the accuracy of the predictions, and in particular the relation between accu-
racy and the sample size. A rigorous introduction to approaches to these important
questions is presented in a new chapter on sample complexity, VC dimension, and
Rademacher averages.

xv
preface to the second edition

We have also used the new edition to enhance some of our previous material. For
example, we present some of the recent advances on algorithmic variations of the pow-
erful Lovász local lemma, and we have a new section covering the wonderfully named
and increasingly useful hashing approach known as cuckoo hashing. Finally, in addi-
tion to all of this new material, the new edition includes updates and corrections, and
many new exercises.
We thank the many readers who sent us corrections over the years – unfortunately,
too many to list here!

xvi
Preface to the First Edition

Why Randomness?

Why should computer scientists study and use randomness? Computers appear to
behave far too unpredictably as it is! Adding randomness would seemingly be a dis-
advantage, adding further complications to the already challenging task of efficiently
utilizing computers.
Science has learned in the last century to accept randomness as an essential com-
ponent in modeling and analyzing nature. In physics, for example, Newton’s laws led
people to believe that the universe was a deterministic place; given a big enough calcu-
lator and the appropriate initial conditions, one could determine the location of planets
years from now. The development of quantum theory suggests a rather different view;
the universe still behaves according to laws, but the backbone of these laws is proba-
bilistic. “God does not play dice with the universe” was Einstein’s anecdotal objection
to modern quantum mechanics. Nevertheless, the prevailing theory today for subparti-
cle physics is based on random behavior and statistical laws, and randomness plays a
significant role in almost every other field of science ranging from genetics and evolu-
tion in biology to modeling price fluctuations in a free-market economy.
Computer science is no exception. From the highly theoretical notion of probabilis-
tic theorem proving to the very practical design of PC Ethernet cards, randomness
and probabilistic methods play a key role in modern computer science. The last two
decades have witnessed a tremendous growth in the use of probability theory in comput-
ing. Increasingly more advanced and sophisticated probabilistic techniques have been
developed for use within broader and more challenging computer science applications.
In this book, we study the fundamental ways in which randomness comes to bear on
computer science: randomized algorithms and the probabilistic analysis of algorithms.
Randomized algorithms: Randomized algorithms are algorithms that make random
choices during their execution. In practice, a randomized program would use values
generated by a random number generator to decide the next step at several branches
of its execution. For example, the protocol implemented in an Ethernet card uses ran-
dom numbers to decide when it next tries to access the shared Ethernet communication
xvii
preface to the first edition

medium. The randomness is useful for breaking symmetry, preventing different cards
from repeatedly accessing the medium at the same time. Other commonly used applica-
tions of randomized algorithms include Monte Carlo simulations and primality testing
in cryptography. In these and many other important applications, randomized algo-
rithms are significantly more efficient than the best known deterministic solutions.
Furthermore, in most cases the randomized algorithms are also simpler and easier to
program.
These gains come at a price; the answer may have some probability of being incor-
rect, or the efficiency is guaranteed only with some probability. Although it may seem
unusual to design an algorithm that may be incorrect, if the probability of error is suf-
ficiently small then the improvement in speed or memory requirements may well be
worthwhile.
Probabilistic analysis of algorithms: Complexity theory tries to classify computa-
tion problems according to their computational complexity, in particular distinguishing
between easy and hard problems. For example, complexity theory shows that the Trav-
eling Salesman problem is NP-hard. It is therefore very unlikely that we will ever know
an algorithm that can solve any instance of the Traveling Salesman problem in time that
is subexponential in the number of cities. An embarrassing phenomenon for the clas-
sical worst-case complexity theory is that the problems it classifies as hard to compute
are often easy to solve in practice. Probabilistic analysis gives a theoretical explanation
for this phenomenon. Although these problems may be hard to solve on some set of
pathological inputs, on most inputs (in particular, those that occur in real-life applica-
tions) the problem is actually easy to solve. More precisely, if we think of the input as
being randomly selected according to some probability distribution on the collection of
all possible inputs, we are very likely to obtain a problem instance that is easy to solve,
and instances that are hard to solve appear with relatively small probability. Probabilis-
tic analysis of algorithms is the method of studying how algorithms perform when the
input is taken from a well-defined probabilistic space. As we will see, even NP-hard
problems might have algorithms that are extremely efficient on almost all inputs.

The Book

This textbook is designed to accompany one- or two-semester courses for advanced

undergraduate or beginning graduate students in computer science and applied math-
ematics. The study of randomized and probabilistic techniques in most leading uni-
versities has moved from being the subject of an advanced graduate seminar meant
for theoreticians to being a regular course geared generally to advanced undergraduate
and beginning graduate students. There are a number of excellent advanced, research-
oriented books on this subject, but there is a clear need for an introductory textbook.
We hope that our book satisfies this need.
The textbook has developed from courses on probabilistic methods in computer sci-
ence taught at Brown (CS 155) and Harvard (CS 223) in recent years. The emphasis in
these courses and in this textbook is on the probabilistic techniques and paradigms, not
on particular applications. Each chapter of the book is devoted to one such method or
xviii
preface to the first edition

technique. Techniques are clarified though examples based on analyzing randomized

algorithms or developing probabilistic analysis of algorithms on random inputs. Many
of these examples are derived from problems in networking, reflecting a prominent
trend in the networking field (and the taste of the authors).
The book contains fourteen chapters. We may view the book as being divided into
two parts, where the first part (Chapters 1–7) comprises what we believe is core mate-
rial. The book assumes only a basic familiarity with probability theory, equivalent to
what is covered in a standard course on discrete mathematics for computer scientists.
Chapters 1–3 review this elementary probability theory while introducing some inter-
esting applications. Topics covered include random sampling, expectation, Markov’s
inequality, variance, and Chebyshev’s inequality. If the class has sufficient background
in probability, then these chapters can be taught quickly. We do not suggest skipping
them, however, because they introduce the concepts of randomized algorithms and
probabilistic analysis of algorithms and also contain several examples that are used
throughout the text.
Chapters 4–7 cover more advanced topics, including Chernoff bounds, balls-and-
bins models, the probabilistic method, and Markov chains. The material in these chap-
ters is more challenging than in the initial chapters. Sections that are particularly chal-
lenging (and hence that the instructor may want to consider skipping) are marked with
an asterisk. The core material in the first seven chapters may constitute the bulk of a
quarter- or semester-long course, depending on the pace.
The second part of the book (Chapters 8–17) covers additional advanced material
that can be used either to fill out the basic course as necessary or for a more advanced
second course. These chapters are largely self-contained, so the instructor can choose
the topics best suited to the class. The chapters on continuous probability and entropy
are perhaps the most appropriate for incorporating into the basic course. Our intro-
duction to continuous probability (Chapter 8) focuses on uniform and exponential
distributions, including examples from queueing theory. Our examination of entropy
(Chapter 10) shows how randomness can be measured and how entropy arises naturally
in the context of randomness extraction, compression, and coding.
Chapters 11 and 12 cover the Monte Carlo method and coupling, respectively; these
chapters are closely related and are best taught together. Chapter 13, on martingales,
covers important issues on dealing with dependent random variables, a theme that con-
tinues in a different vein in Chapter 15 is the development of pairwise independence
and derandomization. Finally, the chapter on balanced allocations (Chapter 17) covers
a topic close to the authors’ hearts and ties in nicely with Chapter 5 concerning analysis
of balls-and-bins problems.
The order of the subjects, especially in the first part of the book, corresponds to
their relative importance in the algorithmic literature. Thus, for example, the study
of Chernoff bounds precedes more fundamental probability concepts such as Markov
chains. However, instructors may choose to teach the chapters in a different order. A
course with more emphasis on general stochastic processes, for example, may teach
Markov chains (Chapter 7) immediately after Chapters 1–3, following with the chapter
on balls, bins, and random graphs (Chapter 5, omitting the Hamiltonian cycle exam-
ple). Chapter 6 on the probabilistic method could then be skipped, following instead
xix
preface to the first edition

with continuous probability and the Poisson process (Chapter 8). The material from
Chapter 4 on Chernoff bounds, however, is needed for most of the remaining material.
Most of the exercises in the book are theoretical, but we have included some pro-
gramming exercises – including two more extensive exploratory assignments that
require some programming. We have found that occasional programming exercises are
often helpful in reinforcing the book’s ideas and in adding some variety to the course.
We have decided to restrict the material in this book to methods and techniques based
on rigorous mathematical analysis; with few exceptions, all claims in this book are fol-
lowed by full proofs. Obviously, many extremely useful probabilistic methods do not
fall within this strict category. For example, in the important area of Monte Carlo meth-
ods, most practical solutions are heuristics that have been demonstrated to be effective
and efficient by experimental evaluation rather than by rigorous mathematical analy-
sis. We have taken the view that, in order to best apply and understand the strengths
and weaknesses of heuristic methods, a firm grasp of underlying probability theory and
rigorous techniques – as we present in this book – is necessary. We hope that students
will appreciate this point of view by the end of the course.

Acknowledgments

Our first thanks go to the many probabilists and computer scientists who developed
the beautiful material covered in this book. We chose not to overload the textbook
with numerous references to the original papers. Instead, we provide a reference list
that includes a number of excellent books giving background material as well as more
advanced discussion of the topics covered here.
The book owes a great deal to the comments and feedback of students and teaching
assistants who took the courses CS 155 at Brown and CS 223 at Harvard. In particular
we wish to thank Aris Anagnostopoulos, Eden Hochbaum, Rob Hunter, and Adam
Kirsch, all of whom read and commented on early drafts of the book.
Special thanks to Dick Karp, who used a draft of the book in teaching CS 174 at
Berkeley during fall 2003. His early comments and corrections were most valuable in
improving the manuscript. Peter Bartlett taught CS 174 at Berkeley in spring 2004, also
providing many corrections and useful comments.
We thank our colleagues who carefully read parts of the manuscript, pointed out
many errors, and suggested important improvements in content and presentation: Artur
Czumaj, Alan Frieze, Claire Kenyon, Joe Marks, Salil Vadhan, Eric Vigoda, and the
anonymous reviewers who read the manuscript for the publisher.
We also thank Rajeev Motwani and Prabhakar Raghavan for allowing us to use some
of the exercises in their excellent book Randomized Algorithms.
We are grateful to Lauren Cowles of Cambridge University Press for her editorial
help and advice in preparing and organizing the manuscript.
Writing of this book was supported in part by NSF ITR Grant no. CCR-0121154.

xx
chapter one
Events and Probability

This chapter introduces the notion of randomized algorithms and reviews some basic
concepts of probability theory in the context of analyzing the performance of simple
randomized algorithms for verifying algebraic identities and finding a minimum cut-set
in a graph.

1.1. Application: Verifying Polynomial Identities

Computers can sometimes make mistakes, due for example to incorrect programming
or hardware failure. It would be useful to have simple ways to double-check the results
of computations. For some problems, we can use randomness to efficiently verify the
correctness of an output.
Suppose we have a program that multiplies together monomials. Consider the prob-
lem of verifying the following identity, which might be output by our program:
?
(x + 1)(x − 2)(x + 3)(x − 4)(x + 5)(x − 6) ≡ x6 − 7x3 + 25.
There is an easy way to verify whether the identity is correct: multiply together the
terms on the left-hand side and see if the resulting polynomial matches the right-hand
side. In this example, when we multiply all the constant terms on the left, the result
does not match the constant term on the right, so the identity cannot be valid. More
generally, given two polynomials F (x) and G(x), we can verify the identity
?
F (x) ≡ G(x)
d i

by converting the two polynomials to their canonical forms i=0 ci x ; two polynomi-
als are equivalent if and only if all the coefficients in their canonical forms are equal.
From thisdpoint on let us assume that, as in our example, F (x) is given as a product
F (x) = i=1 (x − ai ) and G(x) is given in its canonical form. Transforming F (x) to
its canonical form by consecutively multiplying the ith monomial with the product of

1
events and probability

the first i − 1 monomials requires (d 2 ) multiplications of coefficients. We assume in

what follows that each multiplication can be performed in constant time, although if
the products of the coefficients grow large then it could conceivably require more than
constant time to add and multiply numbers together.
So far, we have not said anything particularly interesting. To check whether the
computer program has multiplied monomials together correctly, we have suggested
multiplying the monomials together again to check the result. Our approach for check-
ing the program is to write another program that does essentially the same thing we
expect the first program to do. This is certainly one way to double-check a program:
write a second program that does the same thing, and make sure they agree. There
are at least two problems with this approach, both stemming from the idea that there
should be a difference between checking a given answer and recomputing it. First, if
there is a bug in the program that multiplies monomials, the same bug may occur in
the checking program. (Suppose that the checking program was written by the same
person who wrote the original program!) Second, it stands to reason that we would like
to check the answer in less time than it takes to try to solve the original problem all over
again.
Let us instead utilize randomness to obtain a faster method to verify the identity. We
informally explain the algorithm and then set up the formal mathematical framework
for analyzing the algorithm.
Assume that the maximum degree, or the largest exponent of x, in F (x) and G(x) is
d. The algorithm chooses an integer r uniformly at random in the range {1, . . . , 100d},
where by “uniformly at random” we mean that all integers are equally likely to be
chosen. The algorithm then computes the values F (r) and G(r). If F (r) = G(r) the
algorithm decides that the two polynomials are not equivalent, and if F (r) = G(r) the
algorithm decides that the two polynomials are equivalent.
Suppose that in one computation step the algorithm can generate an integer chosen
uniformly at random in the range {1, . . . , 100d}. Computing the values of F (r) and
G(r) can be done in O(d) time, which is faster than computing the canonical form of
F (r). The randomized algorithm, however, may give a wrong answer.
How can the algorithm give the wrong answer?
If F (x) ≡ G(x), then the algorithm gives the correct answer, since it will find that
F (r) = G(r) for any value of r.
If F (x) ≡ G(x) and F (r) = G(r), then the algorithm gives the correct answer since
it has found a case where F (x) and G(x) disagree. Thus, when the algorithm decides
that the two polynomials are not the same, the answer is always correct.
If F (x) ≡ G(x) and F (r) = G(r), the algorithm gives the wrong answer. In other
words, it is possible that the algorithm decides that the two polynomials are the
same when they are not. For this error to occur, r must be a root of the equation
F (x) − G(x) = 0. The degree of the polynomial F (x) − G(x) is no larger than d and,
by the fundamental theorem of algebra, a polynomial of degree up to d has no more
than d roots. Thus, if F (x) ≡ G(x), then there are no more than d values in the
range {1, . . . , 100d} for which F (r) = G(r). Since there are 100d values in the range
{1, . . . , 100d}, the chance that the algorithm chooses such a value and returns a wrong
answer is no more than 1/100.
2
1.2 axioms of probability

1.2. Axioms of Probability

We turn now to a formal mathematical setting for analyzing the randomized algorithm.
Any probabilistic statement must refer to the underlying probability space.
Definition 1.1: A probability space has three components:
1. a sample space , which is the set of all possible outcomes of the random process
modeled by the probability space;
2. a family of sets F representing the allowable events, where each set in F is a subset1
of the sample space ; and
3. a probability function Pr : F → R satisfying Definition 1.2.
An element of is called a simple or elementary event.
In the randomized algorithm for verifying polynomial identities, the sample space
is the set of integers {1, . . . , 100d}. Each choice of an integer r in this range is a simple
event.
Definition 1.2: A probability function is any function Pr : F → R that satisfies the
following conditions:
1. for any event E, 0 ≤ Pr(E ) ≤ 1;
2. Pr() = 1; and
3. for any finite or countably infinite sequence of pairwise mutually disjoint events
E1 , E2 , E3 , . . . ,

Pr Ei = Pr(Ei ).
i≥1 i≥1

In most of this book we will use discrete probability spaces. In a discrete probability
space the sample space is finite or countably infinite, and the family F of allow-
able events consists of all subsets of . In a discrete probability space, the probability
function is uniquely defined by the probabilities of the simple events.
Again, in the randomized algorithm for verifying polynomial identities, each choice
of an integer r is a simple event. Since the algorithm chooses the integer uniformly at
random, all simple events have equal probability. The sample space has 100d simple
events, and the sum of the probabilities of all simple events must be 1. Therefore each
simple event has probability 1/100d.
Because events are sets, we use standard set theory notation to express combinations
of events. We write E1 ∩ E2 for the occurrence of both E1 and E2 and write E1 ∪ E2 for
the occurrence of either E1 or E2 (or both). For example, suppose we roll two dice. If
E1 is the event that the first die is a 1 and E2 is the event that the second die is a 1, then
E1 ∩ E2 denotes the event that both dice are 1 while E1 ∪ E2 denotes the event that at
least one of the two dice lands on 1. Similarly, we write E1 − E2 for the occurrence

1 In a discrete probability space F = 2 . Otherwise, and introductory readers may skip this point, since the events
need to be measurable, F must include the empty set and be closed under complement and union and intersection
of countably many sets (a σ -algebra).

3
events and probability

of an event that is in E1 but not in E2 . With the same dice example, E1 − E2 consists
of the event where the first die is a 1 and the second die is not. We use the notation Ē
as shorthand for − E; for example, if E is the event that we obtain an even number
when rolling a die, then Ē is the event that we obtain an odd number.
Definition 1.2 yields the following obvious lemma.

Lemma 1.1: For any two events E1 and E2 ,

Pr(E1 ∪ E2 ) = Pr(E1 ) + Pr(E2 ) − Pr(E1 ∩ E2 ).

Proof: From the definition,

Pr(E1 ) = Pr(E1 − (E1 ∩ E2 )) + Pr(E1 ∩ E2 ),

Pr(E2 ) = Pr(E2 − (E1 ∩ E2 )) + Pr(E1 ∩ E2 ),
Pr(E1 ∪ E2 ) = Pr(E1 − (E1 ∩ E2 )) + Pr(E2 − (E1 ∩ E2 )) + Pr(E1 ∩ E2 ).

The lemma easily follows.

A consequence of Definition 1.2 is known as the union bound. Although it is very

simple, it is tremendously useful.

Lemma 1.2: For any finite or countably infinite sequence of events E1 , E2 , . . . ,

Pr Ei ≤ Pr(Ei ).
i≥1 i≥1

Notice that Lemma 1.2 differs from the third part of Definition 1.2 in that Definition
1.2 is an equality and requires the events to be pairwise mutually disjoint.
Lemma 1.1 can be generalized to the following equality, often referred to as the
inclusion–exclusion principle.

Lemma 1.3: Let E1 , . . . , En be any n events. Then

n
n
Pr Ei = Pr(Ei ) − Pr(Ei ∩ E j ) + Pr(Ei ∩ E j ∩ Ek )
i=1 i=1 i< j i< j<k

+1
− · · · + (−1) Pr Eir + ··· .
i1 <i2 <···<i r=1

The proof of the inclusion–exclusion principle is left as Exercise 1.7.

We showed before that the only case in which the algorithm may fail to give the
correct answer is when the two input polynomials F (x) and G(x) are not equivalent;
the algorithm then gives an incorrect answer if the random number it chooses is a root
of the polynomial F (x) − G(x). Let E represent the event that the algorithm failed to
give the correct answer. The elements of the set corresponding to E are the roots of
the polynomial F (x) − G(x) that are in the set of integers {1, . . . , 100d}. Since the
polynomial has no more than d roots it follows that the event E includes no more than
4
1.2 axioms of probability

d simple events, and therefore

d 1
Pr(algorithm fails) = Pr(E ) ≤ = .
100d 100
It may seem unusual to have an algorithm that can return the wrong answer. It may
help to think of the correctness of an algorithm as a goal that we seek to optimize in
conjunction with other goals. In designing an algorithm, we generally seek to minimize
the number of computational steps and the memory required. Sometimes there is a
trade-off; there may be a faster algorithm that uses more memory or a slower algorithm
that uses less memory. The randomized algorithm we have presented gives a trade-off
between correctness and speed. Allowing algorithms that may give an incorrect answer
(but in a systematic way) expands the trade-off space available in designing algorithms.
Rest assured, however, that not all randomized algorithms give incorrect answers, as
we shall see.
For the algorithm just described, the algorithm gives the correct answer 99% of
the time even when the polynomials are not equivalent. Can we improve this prob-
ability? One way is to choose the random number r from a larger range of integers.
If our sample space is the set of integers {1, . . . , 1000d}, then the probability of a
wrong answer is at most 1/1000. At some point, however, the range of values we
can use is limited by the precision available on the machine on which we run the
algorithm.
Another approach is to repeat the algorithm multiple times, using different random
values to test the identity. The property we use here is that the algorithm has a one-sided
error. The algorithm may be wrong only when it outputs that the two polynomials are
equivalent. If any run yields a number r such that F (r) = G(r), then the polynomials are
not equivalent. Thus, if we repeat the algorithm a number of times and find F (r) = G(r)
in at least one round of the algorithm, we know that F (x) and G(x) are not equivalent.
The algorithm outputs that the two polynomials are equivalent only if there is equality
for all runs.
In repeating the algorithm we repeatedly choose a random number in the range
{1, . . . , 100d}. Repeatedly choosing random numbers according to a given distribution
is generally referred to as sampling. In this case, we can repeatedly choose random
numbers in the range {1, . . . , 100d} in two ways: we can sample either with replace-
ment or without replacement. Sampling with replacement means that we do not remem-
ber which numbers we have already tested; each time we run the algorithm, we choose
a number uniformly at random from the range {1, . . . , 100d} regardless of previous
choices, so there is some chance we will choose an r that we have chosen on a previous
run. Sampling without replacement means that, once we have chosen a number r, we
do not allow the number to be chosen on subsequent runs; the number chosen at a given
iteration is uniform over all previously unselected numbers.
Let us first consider the case where sampling is done with replacement. Assume
that we repeat the algorithm k times, and that the input polynomials are not equiva-
lent. What is the probability that in all k iterations our random sampling from the set
{1, . . . , 100d} yields roots of the polynomial F (x) − G(x), resulting in a wrong output
by the algorithm? If k = 1, we know that this probability is at most d/100d = 1/100.
5
events and probability

If k = 2, it seems that the probability that the first iteration finds a root is at most 1/100
and the probability that the second iteration finds a root is at most 1/100, so the prob-
ability that both iterations find a root is at most (1/100)2 . Generalizing, for any k, the
probability of choosing roots for k iterations would be at most (1/100)k .
To formalize this, we introduce the notion of independence.
Definition 1.3: Two events E and F are independent if and only if
Pr(E ∩ F ) = Pr(E ) · Pr(F ).
More generally, events E1 , E2 , . . . , Ek are mutually independent if and only if, for any
subset I ⊆ [1, k],

Pr Ei = Pr(Ei ).
i∈I i∈I

If our algorithm samples with replacement then in each iteration the algorithm chooses
a random number uniformly at random from the set {1, . . . , 100d}, and thus the choice
in one iteration is independent of the choices in previous iterations. For the case where
the polynomials are not equivalent, let Ei be the event that, on the ith run of the algo-
rithm, we choose a root ri such that F (ri ) − G(ri ) = 0. The probability that the algo-
rithm returns the wrong answer is given by
Pr(E1 ∩ E2 ∩ · · · ∩ Ek ).
Since Pr(Ei ) is at most d/100d and since the events E1 , E2 , . . . , Ek are independent,
the probability that the algorithm gives the wrong answer after k iterations is
k k
d 1 k
Pr(E1 ∩ E2 ∩ · · · ∩ Ek ) = Pr(Ei ) ≤ = .
i=1 i=1
100d 100

The probability of making an error is therefore at most exponentially small in the num-
ber of trials.
Now let us consider the case where sampling is done without replacement. In this
case the probability of choosing a given number is conditioned on the events of the
previous iterations.
Definition 1.4: The conditional probability that event E occurs given that event F
occurs is
Pr(E ∩ F )
Pr(E | F ) = .
Pr(F )
The conditional probability is well-defined only if Pr(F ) > 0.
Intuitively, we are looking for the probability of E ∩ F within the set of events defined
by F. Because F defines our restricted sample space, we normalize the probabilities
by dividing by Pr(F ), so that the sum of the probabilities of all events is 1. When
Pr(F ) > 0, the definition can also be written in the useful form
Pr(E | F ) Pr(F ) = Pr(E ∩ F ).

6
1.2 axioms of probability

Notice that, when E and F are independent and Pr(F ) = 0, we have

Pr(E ∩ F ) Pr(E ) Pr(F )
Pr(E | F ) = = = Pr(E ).
Pr(F ) Pr(F )
This is a property that conditional probability should have; intuitively, if two events are
independent, then information about one event should not affect the probability of the
second event.
Again assume that we repeat the algorithm k times and that the input polynomials are
not equivalent. What is the probability that in all the k iterations our random sampling
from the set {1, . . . , 100d} yields roots of the polynomial F (x) − G(x), resulting in a
wrong output by the algorithm?
As in the analysis with replacement, we let Ei be the event that the random num-
ber ri chosen in the ith iteration of the algorithm is a root of F (x) − G(x); again, the
probability that the algorithm returns the wrong answer is given by
Pr(E1 ∩ E2 ∩ · · · ∩ Ek ).
Applying the definition of conditional probability, we obtain
Pr(E1 ∩ E2 ∩ · · · ∩ Ek ) = Pr(Ek | E1 ∩ E2 ∩ · · · ∩ Ek−1 ) · Pr(E1 ∩ E2 ∩ · · · ∩ Ek−1 ),
and repeating this argument gives
Pr(E1 ∩ E2 ∩ · · · ∩ Ek )
= Pr(E1 ) · Pr(E2 | E1 ) · Pr(E3 | E1 ∩ E2 ) · · · Pr(Ek | E1 ∩ E2 ∩ · · · ∩ Ek−1 ).
Can we bound Pr(E j | E1 ∩ E2 ∩ · · · ∩ E j−1 )? Recall that there are at most d values
r for which F (r) − G(r) = 0; if trials 1 through j − 1 < d have found j − 1 of them,
then when sampling without replacement there are only d − ( j − 1) values out of the
100d − ( j − 1) remaining choices for which F (r) − G(r) = 0. Hence
d − ( j − 1)
Pr(E j | E1 ∩ E2 ∩ · · · ∩ E j−1 ) ≤ ,
100d − ( j − 1)
and the probability that the algorithm gives the wrong answer after k ≤ d iterations is
bounded by
k
d − ( j − 1) 1 k
Pr(E1 ∩ E2 ∩ · · · ∩ Ek ) ≤ ≤ .
j=1
100d − ( j − 1) 100

Because (d − ( j − 1))/(100d − ( j − 1)) < d/100d when j > 1, our bounds on the
probability of making an error are actually slightly better without replacement. You
may also notice that, if we take d + 1 samples without replacement and the two poly-
nomials are not equivalent, then we are guaranteed to find an r such that F (r) − G(r) =
0. Thus, in d + 1 iterations we are guaranteed to output the correct answer. However,
computing the value of the polynomial at d + 1 points takes (d 2 ) time using the stan-
dard approach, which is no faster than finding the canonical form deterministically.
Since sampling without replacement appears to give better bounds on the probability
of error, why would we ever want to consider sampling with replacement? In some
cases, sampling with replacement is significantly easier to analyze, so it may be worth
7
events and probability

considering for theoretical reasons. In practice, sampling with replacement is often

simpler to code and the effect on the probability of making an error is almost negligible,
making it a desirable alternative.

1.3. Application: Verifying Matrix Multiplication

We now consider another example where randomness can be used to verify an equality
more quickly than the known deterministic algorithms. Suppose we are given three
n × n matrices A, B, and C. For convenience, assume we are working over the integers
modulo 2. We want to verify whether
AB = C.
One way to accomplish this is to multiply A and B and compare the result to C. The sim-
ple matrix multiplication algorithm takes (n3 ) operations. There exist more sophisti-
cated algorithms that are known to take roughly (n2.37 ) operations.
Once again, we use a randomized algorithm that allows for faster verification – at the
expense of possibly returning a wrong answer with small probability. The algorithm is
similar in spirit to our randomized algorithm for checking polynomial identities. The
algorithm chooses a random vector r̄ = (r1 , r2 , . . . , rn ) ∈ {0, 1}n . It then computes ABr̄
by first computing Br̄ and then A(Br̄), and it also computes Cr̄. If A(Br̄) = Cr̄, then
AB = C. Otherwise, it returns that AB = C.
The algorithm requires three matrix-vector multiplications, which can be done in
time (n2 ) in the obvious way. The probability that the algorithm returns that AB = C
when they are actually not equal is bounded by the following theorem.
Theorem 1.4: If AB = C and if r̄ is chosen uniformly at random from {0, 1}n , then
1
Pr(ABr̄ = Cr̄) ≤ .
2
Proof: Before beginning, we point out that the sample space for the vector r̄ is the set
{0, 1}n and that the event under consideration is ABr̄ = Cr̄. We also make note of the
following simple but useful lemma.
Lemma 1.5: Choosing r̄ = (r1 , r2 , . . . , rn ) ∈ {0, 1}n uniformly at random is equiva-
lent to choosing each ri independently and uniformly from {0, 1}.
Proof: If each ri is chosen independently and uniformly at random, then each of the
2n possible vectors r̄ is chosen with probability 2−n , giving the lemma.

Let D = AB − C = 0. Then ABr̄ = Cr̄ implies that Dr̄ = 0. Since D = 0 it must have
some nonzero entry; without loss of generality, let that entry be d11 .
For Dr̄ = 0, it must be the case that

n
d1 j r j = 0
j=1

8
1.3 application: verifying matrix multiplication

or, equivalently,
n
j=2 d1 j r j
r1 = − . (1.1)
d11
Now we introduce a helpful idea. Instead of reasoning about the vector r̄, suppose
that we choose the rk independently and uniformly at random from {0, 1} in order,
from rn down to r1 . Lemma 1.5 says that choosing the rk in this way is equivalent to
choosing a vector r̄ uniformly at random. Now consider the situation just before r1 is
chosen. At this point, the right-hand side of Eqn. (1.1) is determined, and there is at
most one choice for r1 that will make that equality hold. Since there are two choices
for r1 , the equality holds with probability at most 1/2, and hence the probability that
ABr̄ = Cr̄ is at most 1/2. By considering all variables besides r1 as having been set, we
have reduced the sample space to the set of two values {0, 1} for r1 and have changed
the event being considered to whether Eqn. (1.1) holds.

This idea is called the principle of deferred decisions. When there are several random
variables, such as the ri of the vector r̄, it often helps to think of some of them as being
set at one point in the algorithm with the rest of them being left random – or deferred –
until some further point in the analysis. Formally, this corresponds to conditioning on
the revealed values; when some of the random variables are revealed, we must condition
on the revealed values for the rest of the analysis. We will see further examples of the
principle of deferred decisions later in the book.
To formalize this argument, we first introduce a simple fact, known as the law of
total probability.

Theorem 1.6 [Law of Total Probability]:

n Let E1 , E2 , . . . , En be mutually disjoint
events in the sample space , and let i=1 Ei = . Then

n
n
Pr(B) = Pr(B ∩ Ei ) = Pr(B | Ei ) Pr(Ei ).
i=1 i=1

Proof: Since the events Ei (i = 1, . . . , n) are disjoint and cover the entire sample space
, it follows that

n
Pr(B) = Pr(B ∩ Ei ).
i=1

Further,

n
n
Pr(B ∩ Ei ) = Pr(B | Ei ) Pr(Ei )
i=1 i=1

by the definition of conditional probability.

9
events and probability

Now, using this law and summing over all collections of values (x2 , x3 , x4 , . . . , xn ) ∈
{0, 1}n−1 yields
Pr(ABr̄ = Cr̄)

= Pr (ABr̄ = Cr̄) ∩ ((r2 , . . . , rn ) = (x2 , . . . , xn ))
(x2 ,...,xn )∈{0,1}n−1
n
j=2 d1 j r j
≤ Pr r1 = − ∩ ((r2 , . . . , rn ) = (x2 , . . . , xn ))
d11
(x2 ,...,xn )∈{0,1}n−1
n
j=2 d1 j r j
= Pr r1 = − · Pr((r2 , . . . , rn ) = (x2 , . . . , xn ))
d11
(x2 ,...,xn )∈{0,1}n−1
1
≤ Pr((r2 , . . . , rn ) = (x2 , . . . , xn ))
2
(x2 ,...,xn )∈{0,1}n−1
1
= .
2
Here we have used the independence of r1 and (r2 , . . . , rn ) in the fourth line.
To improve on the error probability of Theorem 1.4, we can again use the fact that
the algorithm has a one-sided error and run the algorithm multiple times. If we ever
find an r̄ such that ABr̄ = Cr̄, then the algorithm will correctly return that AB = C. If
we always find ABr̄ = Cr̄, then the algorithm returns that AB = C and there is some
probability of a mistake. Choosing r̄ with replacement from {0, 1}n for each trial, we
obtain that, after k trials, the probability of error is at most 2−k . Repeated trials increase
the running time to (kn2 ).
Suppose we attempt this verification 100 times. The running time of the random-
ized checking algorithm is still (n2 ), which is faster than the known deterministic
algorithms for matrix multiplication for sufficiently large n. The probability that an
incorrect algorithm passes the verification test 100 times is at most 2−100 , an astronom-
ically small number. In practice, the computer is much more likely to crash during the
execution of the algorithm than to return a wrong answer.
An interesting related problem is to evaluate the gradual change in our confidence in
the correctness of the matrix multiplication as we repeat the randomized test. Toward
that end we introduce Bayes’ law.
Theorem 1.7 [Bayes’ Law]: Assume
that E1 , E2 , . . . , En are mutually disjoint events
in the sample space such that ni=1 Ei = . Then
Pr(E j ∩ B) Pr(B | E j ) Pr(E j )
Pr(E j | B) = = n .
Pr(B) i=1 Pr(B | Ei ) Pr(Ei )

As a simple application of Bayes’ law, consider the following problem. We are given
three coins and are told that two of the coins are fair and the third coin is biased, landing
heads with probability 2/3. We are not told which of the three coins is biased. We
permute the coins randomly, and then flip each of the coins. The first and second coins
come up heads, and the third comes up tails. What is the probability that the first coin
is the biased one?
10
1.3 application: verifying matrix multiplication

The coins are in a random order and so, before our observing the outcomes of the
coin flips, each of the three coins is equally likely to be the biased one. Let Ei be the
event that the ith coin flipped is the biased one, and let B be the event that the three coin
flips came up heads, heads, and tails.
Before we flip the coins we have Pr(Ei ) = 1/3 for all i. We can also compute the
probability of the event B conditioned on Ei :

2 1 1 1
Pr(B | E1 ) = Pr(B | E2 ) = · · = ,
3 2 2 6
and
1 1 1 1
Pr(B | E3 ) = · · = .
2 2 3 12
Applying Bayes’ law, we have

Pr(B | E1 ) Pr(E1 ) 2
Pr(E1 | B) = 3 = .
i=1 Pr(B | Ei ) Pr(Ei )
5

Thus, the outcome of the three coin flips increases the likelihood that the first coin is
the biased one from 1/3 to 2/5.
Returning now to our randomized matrix multiplication test, we want to evaluate
the increase in confidence in the matrix identity obtained through repeated tests. In the
Bayesian approach one starts with a prior model, giving some initial value to the model
parameters. This model is then modified, by incorporating new observations, to obtain
a posterior model that captures the new information.
In the matrix multiplication case, if we have no information about the process that
generated the identity then a reasonable prior assumption is that the identity is correct
with probability 1/2. If we run the randomized test once and it returns that the matrix
identity is correct, how does this change our confidence in the identity?
Let E be the event that the identity is correct, and let B be the event that the test
returns that the identity is correct. We start with Pr(E ) = Pr(Ē ) = 1/2, and since the
test has a one-sided error bounded by 1/2, we have Pr(B | E ) = 1 and Pr(B | Ē ) ≤ 1/2.
Applying Bayes’ law yields

Pr(B | E ) Pr(E ) 1/2 2

Pr(E | B) = ≥ = .
Pr(B | E ) Pr(E ) + Pr(B | Ē ) Pr(Ē ) 1/2 + 1/2 · 1/2 3

Assume now that we run the randomized test again and it again returns that the
identity is correct. After the first test, I may naturally have revised my prior model, so
that I believe Pr(E ) ≥ 2/3 and Pr(Ē ) ≤ 1/3. Now let B be the event that the new test
returns that the identity is correct; since the tests are independent, as before we have
Pr(B | E ) = 1 and Pr(B | Ē ) ≤ 1/2. Applying Bayes’ law then yields

2/3 4
Pr(E | B) ≥ = .
2/3 + 1/3 · 1/2 5

11
events and probability

In general: if our prior model (before running the test) is that Pr(E ) ≥ 2i /(2i + 1)
and if the test returns that the identity is correct (event B), then
2i
2i + 1 2i+1 1
Pr(E | B) ≥ = i+1 + 1
= 1 − i+1 .
2i
1 1 2 2 +1
+
2i + 1 2 2i + 1
Thus, if all 100 calls to the matrix identity test return that the identity is correct, our
confidence in the correctness of this identity is at least 1 − 1/(2101 + 1).

1.4. Application: Naïve Bayesian Classifier

A naïve Bayesian classifier is a supervised learning algorithm that classifies objects by

estimating conditional probabilities using Bayes’ law in a simplified (“naïve”) prob-
abilistic model. While the independence assumptions that would justify the approach
are significant oversimplifications, this method proves very effective in many practical
applications such as subject classification of text documents and junk e-mail filtering. It
also provides an example of a deterministic algorithm that is based on the probabilistic
concept of conditional probability.
Assume that we are given a collection of n training examples

{(D1 , c(D1 )), (D2 , c(D2 )), . . . , (Dn , c(Dn ))},

where each Di is represented as a features vector xi = (x1i , . . . , xm i

). Here Di is an object,
such as a text document, and an object has features (X1 , X2 , . . . , Xm ), where feature X j
can take a value from a set of possibilities Fj . By xi = (x1i , . . . , xm i
) we mean that for
Di we have X1 = x1 , . . . , Xm = xm . For example, if Di is a text document, and we have
i i

a list of important keywords, the X j could be Boolean features where xij = 1 if the
jth listed keyword appears in Di and xij = 0 otherwise. In this case, the feature vector
of the document would just correspond to the set of keywords it contains. Finally, we
have a set C = {c1 , c2 , . . . , ct } of possible classifications for the object, and c(Di ) is the
classification of Di . For example, the classification set C could be a collection of labels
such as {“spam”, “no-spam”}. Given a document, corresponding to a web page or e-
mail, we might want to classify the document according to the keywords the document
contains.
The classification paradigm assumes that the training set is a sample from an
unknown distribution in which the classification of an object is a function of the m
features. The goal is, given a new document, to return an accurate classification. More
generally, we can instead return a vector (z1 , z2 , . . . , zt ), where z j is an estimate of the
probability that c(Di ) = c j based on the training set. If we want to return just the most
likely classification, we can return the c j with the highest value of z j .
Suppose to begin that we had a very, very large training set. Then for each vec-
tor y = (y1 , . . . , ym ) and each classification c j , we could use the training set to com-
pute the empirical conditional probability that an object with a features vector y is
12
1.4 application: naïve bayesian classifier

classified C j :
{|i : xi = y, c(Di ) = c j |}
py, j = .
{|i : xi = y)|}
Assuming that a new object D∗ with a features vector x∗ has the same distribution as
the training set, then px∗ , j is an empirical estimate for the conditional probability
Pr(c(D∗ ) = c j | x∗ = (x1∗ , . . . , xm
∗
)).
Indeed, we could compute these values ahead of time in a large lookup table and simply
return the vector (z1 , z2 , . . . , zt ) = (px∗ ,1 , px∗ ,2 , . . . , px∗ ,t ) after computing the features
vector x∗ from the object.
The difficulty in this approach is that we need to obtain accurate estimates of a large
collection of conditional probabilities, corresponding to all possible combination of
values of the m features. Even if each feature has just two values we would need to esti-
mate 2m conditional probabilities per class, which would generally require (|C|2m )
samples.
The training process is faster and requires significantly fewer examples if we assume
a “naïve” model in which the m features are independent. In that case we have for
Pr(x∗ | c(D∗ ) = c j ) · Pr(c(D∗ ) = c j )
Pr(c(D∗ ) = c j | x∗ ) = (1.2)
Pr(x∗ )
m
Pr(xk∗ = xi | c(D∗ ) = c j ) · Pr(c(D∗ ) = c j )
= k=1 . (1.3)
Pr(x∗ )
Here xk∗ is the kth component of the features vector x∗ of object D∗ . Notice that the
denominator is independent of c j , and can be treated as just a normalizing constant
factor.
With a constant number of possible values per feature, we only need to learn esti-
mates for O(m|C|) probabilities. In what follows, we use Pr ˆ to denote empirical prob-
abilities, which are the relative frequency of events in our training set of examples. This
notation emphasizes that we are taking estimates of these probabilities as determined
from the training set. (In practice, one often makes slight modifications, such as adding
1/2 to the numerator in each of the fractions to guarantee that no empirical probability
equals 0.)
The training process is simple:
r For each classification class c j , keep track of the fraction of objects classified as c j
to compute
∗ |{i | c(Di ) = c j }|
ˆ
Pr(c(D ) = cj) = ,
|D|
where |D| is the number of objects in the training set.
r For each feature Xk and feature value xk keep track of the fraction of objects with that
feature value that are classified as c j , to compute

ˆ k∗ = xk | c(D∗ ) = c j ) = |{i : xk = xk , c(Di ) = c j }| .

i
Pr(x
{i | c(Di ) = c j }|
13
events and probability

Naïve Bayes Classifier Algorithm

Input: Set of possible classifications C, set of features and feature values

F1 , . . . , Fm , and a training set of classified items D.
Training Phase:
1. For each category c ∈ C, feature k = 1, . . . , m, and feature value xk ∈ Fk com-
pute

ˆ k∗ = xk | c(D∗ ) = c) = |{i : xk = xk , c(Di ) = c}| .

i
Pr(x
{i | c(Di ) = c}|
2. For each category c ∈ C, compute
∗ |{i | c(Di ) = c}|
ˆ
Pr(c(D ) = c) = .
|D|
Classifying a new item D∗ :
1. To compute the most likely classification for x∗ = x = (x1 , . . . , xm )
m

c(D∗ ) = arg max ˆ k∗ = xk | c(D∗ ) = c j ) Pr(c(D
Pr(x ˆ ∗
) = c j ).
c j ∈C
k=1

2. To compute a classification distribution:

m
ˆ ∗ = xk | c(D∗ ) = c j ) Pr(c(D
Pr(x ˆ ∗
) = cj)
∗ k=1 k
ˆ
Pr(c(D ) = cj) = .
∗
Pr(x = x)
ˆ

Algorithm 1.1: Naïve Bayes Classifier.

Once we train the classifier, the classification of a new object D∗ with features vector
x∗ = (x1∗ , . . . , xm
∗
) is computed by calculating
m

∗ ∗
ˆ k = xk | c(D ) = c j ) Pr(c(D∗
Pr(x ˆ ) = cj)
k=1

for each c j and taking the classification with the highest value.
In practice, the products may lead to underflow values; an easy solution to that prob-
lem is to instead compute the logarithm of the above expression. Estimates of the
entire probability vector can be found by normalizing appropriately. (Alternatively,
instead of normalizing, one could provide probability estimates by also computing
estimates for Pr(x∗ = x) from the sample data. Under our independence assumption
Pr(x∗ = (x1∗ , . . . , xm
∗
)) = m ∗
k=1 Pr(xk = xk ), and one could estimate the denominator
of Equation 1.2 with the product of the corresponding estimates.)
The naïve Bayesian classifier is efficient and simple to implement due to the “naïve”
assumption of independence. This assumption may lead to misleading outcomes when
the classification depends on combinations of features. As a simple example consider

14
1.5 application: a randomized min-cut algorithm

a collection of items characterized by two Boolean features X and Y. If X = Y the item

is in class A, and otherwise it is in class B. Assume further that for each value of X
and Y the training set has an equal number of items in each class. All the conditional
probabilities computed by the classifier equal 0.5, and therefore the classifier is not
better than a coin flip in this example. In practice such phenomena are rare and the
naïve Bayesian classifier is often very effective.

1.5. Application: A Randomized Min-Cut Algorithm

A cut-set in a graph is a set of edges whose removal breaks the graph into two or
more connected components. Given a graph G = (V, E ) with n vertices, the minimum
cut – or min-cut – problem is to find a minimum cardinality cut-set in G. Minimum
cut problems arise in many contexts, including the study of network reliability. In the
case where nodes correspond to machines in the network and edges correspond to con-
nections between machines, the min-cut is the smallest number of edges that can fail
before some pair of machines cannot communicate. Minimum cuts also arise in clus-
tering problems. For example, if nodes represent Web pages (or any documents in a
hypertext-based system) and two nodes have an edge between them if the correspond-
ing nodes have a hyperlink between them, then small cuts divide the graph into clus-
ters of documents with few links between clusters. Documents in different clusters are
likely to be unrelated.
We shall proceed by making use of the definitions and techniques presented so far
in order to analyze a simple randomized algorithm for the min-cut problem. The main
operation in the algorithm is edge contraction. In contracting an edge (u, v) we merge
the two vertices u and v into one vertex, eliminate all edges connecting u and v, and
retain all other edges in the graph. The new graph may have parallel edges but no
self-loops. Examples appear in Figure 1.1, where in each step the dark edge is being
contracted.
The algorithm consists of n − 2 iterations. In each iteration, the algorithm picks an
edge from the existing edges in the graph and contracts that edge. There are many pos-
sible ways one could choose the edge at each step. Our randomized algorithm chooses
the edge uniformly at random from the remaining edges.
Each iteration reduces the number of vertices in the graph by one. After n − 2 iter-
ations, the graph consists of two vertices. The algorithm outputs the set of edges con-
necting the two remaining vertices.
It is easy to verify that any cut-set of a graph in an intermediate iteration of the
algorithm is also a cut-set of the original graph. On the other hand, not every cut-set of
the original graph is a cut-set of a graph in an intermediate iteration, since some edges
of the cut-set may have been contracted in previous iterations. As a result, the output of
the algorithm is always a cut-set of the original graph but not necessarily the minimum
cardinality cut-set (see Figure 1.1).
We now establish a lower bound on the probability that the algorithm returns a cor-
rect output.

15
events and probability

1 3 1

5 5 5 5
3,4 1,3,4 1,2,3,4
2 4 2 2
(a) A successful run of min-cut.
1 3 1 1 1

5 5
3,4 3,4,5
2,3,4,5
2 4 2 2
(b) An unsuccessful run of min-cut.

Figure 1.1: An example of two executions of min-cut in a graph with minimum cut-set of size 2.

Theorem 1.8: The algorithm outputs a min-cut set with probability at least
2/(n(n − 1)).

Proof: Let k be the size of the min-cut set of G. The graph may have several cut-sets
of minimum size. We compute the probability of finding one specific such set C.
Since C is a cut-set in the graph, removal of the set C partitions the set of vertices into
two sets, S and V − S, such that there are no edges connecting vertices in S to vertices in
V − S. Assume that, throughout an execution of the algorithm, we contract only edges
that connect two vertices in S or two vertices in V − S, but not edges in C. In that case,
all the edges eliminated throughout the execution will be edges connecting vertices in
S or vertices in V − S, and after n − 2 iterations the algorithm returns a graph with two
vertices connected by the edges in C. We may therefore conclude that, if the algorithm
never chooses an edge of C in its n − 2 iterations, then the algorithm returns C as the
minimum cut-set.
This argument gives some intuition for why we choose the edge at each iteration
uniformly at random from the remaining existing edges. If the size of the cut C is
small and if the algorithm chooses the edge uniformly at each step, then the probability
that the algorithm chooses an edge of C is small – at least when the number of edges
remaining is large compared to C.
iLet Ei be the event that the edge contracted in iteration i is not in C, and let Fi =
j=1 E j be the event that no edge of C was contracted in the first i iterations. We need
to compute Pr(Fn−2 ).
We start by computing Pr(E1 ) = Pr(F1 ). Since the minimum cut-set has k edges, all
vertices in the graph must have degree k or larger. If each vertex is adjacent to at least k
edges, then the graph must have at least nk/2 edges. The first contracted edge is chosen
uniformly at random from the set of all edges. Since there are at least nk/2 edges in the
graph and since C has k edges, the probability that we do not choose an edge of C in
the first iteration is given by
2k 2
Pr(E1 ) = Pr(F1 ) ≥ 1 − =1− .
nk n
16
1.6 exercises

Let us suppose that the first contraction did not eliminate an edge of C. In other
words, we condition on the event F1 . Then, after the first iteration, we are left with an
(n − 1)-node graph with minimum cut-set of size k. Again, the degree of each vertex
in the graph must be at least k, and the graph must have at least k(n − 1)/2 edges.
Thus,
k 2
Pr(E2 | F1 ) ≥ 1 − =1− .
k(n − 1)/2 n−1
Similarly,
k 2
Pr(Ei | Fi−1 ) ≥ 1 − =1− .
k(n − i + 1)/2 n−i+1
To compute Pr(Fn−2 ), we use
Pr(Fn−2 ) = Pr(En−2 ∩ Fn−3 ) = Pr(En−2 | Fn−3 ) · Pr(Fn−3 )
= Pr(En−2 | Fn−3 ) · Pr(En−3 | Fn−4 ) · · · Pr(E2 | F1 ) · Pr(F1 )
n−2
n−2
2 n−i−1
≥ 1− =
i=1
n−i+1 i=1
n−i+1

n−2 n−3 n−4 4 3 2 1
= ...
n n−1 n−2 6 5 4 3
2
= .
n(n − 1)
Since the algorithm has a one-sided error, we can reduce the error probability by repeat-
ing the algorithm. Assume that we run the randomized min-cut algorithm n(n − 1) ln n
times and output the minimum size cut-set found in all the iterations. The probability
that the output is not a min-cut set is bounded by
n(n−1) ln n
2 1
1− ≤ e−2 ln n = 2 .
n(n − 1) n
In the first inequality we have used the fact that 1 − x ≤ e−x .

1.6. Exercises

Exercise 1.1: We flip a fair coin ten times. Find the probability of the following events.
(a) The number of heads and the number of tails are equal.
(b) There are more heads than tails.
(c) The ith flip and the (11 − i)th flip are the same for i = 1, . . . , 5.
(d) We flip at least four consecutive heads.

Exercise 1.2: We roll two standard six-sided dice. Find the probability of the following
events, assuming that the outcomes of the rolls are independent.
(a) The two dice show the same number.
(b) The number that appears on the first die is larger than the number on the second.
17
events and probability

(c) The sum of the dice is even.

(d) The product of the dice is a perfect square.

Exercise 1.3: We shuffle a standard deck of cards, obtaining a permutation that

is uniform over all 52! possible permutations. Find the probability of the following
events.
(a) The first two cards include at least one ace.
(b) The first five cards include at least one ace.
(c) The first two cards are a pair of the same rank.
(d) The first five cards are all diamonds.
(e) The first five cards form a full house (three of one rank and two of another rank).

Exercise 1.4: We are playing a tournament in which we stop as soon as one of us wins
n games. We are evenly matched, so each of us wins any game with probability 1/2,
independently of other games. What is the probability that the loser has won k games
when the match is over?

Exercise 1.5: After lunch one day, Alice suggests to Bob the following method to
determine who pays. Alice pulls three six-sided dice from her pocket. These dice are
not the standard dice, but have the following numbers on their faces:
r die A – 1, 1, 6, 6, 8, 8;
r die B – 2, 2, 4, 4, 9, 9;
r die C – 3, 3, 5, 5, 7, 7.

The dice are fair, so each side comes up with equal probability. Alice explains that she
and Bob will each pick up one of the dice. They will each roll their die, and the one
who rolls the lowest number loses and will buy lunch. So as to take no advantage, Alice
offers Bob the first choice of the dice.
(a) Suppose that Bob chooses die A and Alice chooses die B. Write out all of the
possible events and their probabilities, and show that the probability that Alice
wins is greater than 1/2.
(b) Suppose that Bob chooses die B and Alice chooses die C. Write out all of the
possible events and their probabilities, and show that the probability that Alice
wins is greater than 1/2.
(c) Since die A and die B lead to situations in Alice’s favor, it would seem that Bob
should choose die C. Suppose that Bob does choose die C and Alice chooses die
A. Write out all of the possible events and their probabilities, and show that the
probability that Alice wins is still greater than 1/2.

Exercise 1.6: Consider the following balls-and-bin game. We start with one black ball
and one white ball in a bin. We repeatedly do the following: choose one ball from the
bin uniformly at random, and then put the ball back in the bin with another ball of the
same color. We repeat until there are n balls in the bin. Show that the number of white
balls is equally likely to be any number between 1 and n − 1.
18
1.6 exercises

Exercise 1.7: (a) Prove Lemma 3, the inclusion–exclusion principle.

(b) Prove that, when is odd,
n
n
Pr Ei ≤ Pr(Ei ) − Pr(Ei ∩ E j ) + Pr(Ei ∩ E j ∩ Ek )
i=1 i=1 i< j i< j<k

+1
− · · · + (−1) Pr(Ei1 ∩ · · · ∩ Ei ).
i1 <i2 <···<i

(c) Prove that, when is even,

n
n
Pr Ei ≥ Pr(Ei ) − Pr(Ei ∩ E j ) + Pr(Ei ∩ E j ∩ Ek )
i=1 i=1 i< j i< j<k

+1
− · · · + (−1) Pr(Ei1 ∩ · · · ∩ Ei ).
i1 <i2 <···<i

Exercise 1.8: I choose a number uniformly at random from the range [1, 1,000,000].
Using the inclusion–exclusion principle, determine the probability that the number cho-
sen is divisible by one or more of 4, 6, and 9.

Exercise 1.9: Suppose that a fair coin is flipped n times. For k > 0, find an upper
bound on the probability that there is a sequence of log2 n + k consecutive heads.

Exercise 1.10: I have a fair coin and a two-headed coin. I choose one of the two coins
randomly with equal probability and flip it. Given that the flip was heads, what is the
probability that I flipped the two-headed coin?

Exercise 1.11: I am trying to send you a single bit, either a 0 or a 1. When I transmit
the bit, it goes through a series of n relays before it arrives to you. Each relay flips the
bit independently with probability p.
(a) Argue that the probability you receive the correct bit is
n/2
n 2k
p (1 − p)n−2k .
k=0
2k
(b) We consider an alternative way to calculate this probability. Let us say the relay
has bias q if the probability it flips the bit is (1 − q)/2. The bias q is therefore a
real number in the range [−1, 1]. Prove that sending a bit through two relays with
bias q1 and q2 is equivalent to sending a bit through a single relay with bias q1 q2 .
(c) Prove that the probability you receive the correct bit when it passes through n relays
as described before (a) is
1 + (1 − 2p)n
.
2
Exercise 1.12: The following problem is known as the Monty Hall problem, after
the host of the game show “Let’s Make a Deal”. There are three curtains. Behind one
curtain is a new car, and behind the other two are goats. The game is played as follows.
19
events and probability

The contestant chooses the curtain that she thinks the car is behind. Monty then opens
one of the other curtains to show a goat. (Monty may have more than one goat to choose
from; in this case, assume he chooses which goat to show uniformly at random.) The
contestant can then stay with the curtain she originally chose or switch to the other
unopened curtain. After that, the location of the car is revealed, and the contestant wins
the car or the remaining goat. Should the contestant switch curtains or not, or does it
make no difference?

Exercise 1.13: A medical company touts its new test for a certain genetic disorder.
The false negative rate is small: if you have the disorder, the probability that the test
returns a positive result is 0.999. The false positive rate is also small: if you do not
have the disorder, the probability that the test returns a positive result is only 0.005.
Assume that 2% of the population has the disorder. If a person chosen uniformly from
the population is tested and the result comes back positive, what is the probability that
the person has the disorder?

Exercise 1.14: I am playing in a racquetball tournament, and I am up against a player

I have watched but never played before. I consider three possibilities for my prior
model: we are equally talented, and each of us is equally likely to win each game;
I am slightly better, and therefore I win each game independently with probability 0.6;
or he is slightly better, and thus he wins each game independently with probability 0.6.
Before we play, I think that each of these three possibilities is equally likely.
In our match we play until one player wins three games. I win the second game, but
he wins the first, third, and fourth. After this match, in my posterior model, with what
probability should I believe that my opponent is slightly better than I am?

Exercise 1.15: Suppose that we roll ten standard six-sided dice. What is the probabil-
ity that their sum will be divisible by 6, assuming that the rolls are independent? (Hint:
Use the principle of deferred decisions, and consider the situation after rolling all but
one of the dice.)

Exercise 1.16: Consider the following game, played with three standard six-sided
dice. If the player ends with all three dice showing the same number, she wins. The
player starts by rolling all three dice. After this first roll, the player can select any one,
two, or all of the three dice and re-roll them. After this second roll, the player can
again select any of the three dice and re-roll them one final time. For questions (a)–(d),
assume that the player uses the following optimal strategy: if all three dice match, the
player stops and wins; if two dice match, the player re-rolls the die that does not match;
and if no dice match, the player re-rolls them all.
(a) Find the probability that all three dice show the same number on the first roll.
(b) Find the probability that exactly two of the three dice show the same number on
the first roll.
(c) Find the probability that the player wins, conditioned on exactly two of the three
dice showing the same number on the first roll.
20
1.6 exercises

(d) By considering all possible sequences of rolls, find the probability that the player
wins the game.

Exercise 1.17: In our matrix multiplication algorithm, we worked over the integers
modulo 2. Explain how the analysis would change if we worked over the integers mod-
ulo k for k > 2.

Exercise 1.18: We have a function F : {0, . . . , n − 1} → {0, . . . , m − 1}. We

know that, for 0 ≤ x, y ≤ n − 1, F ((x + y) mod n) = (F (x) + F (y)) mod m. The only
way we have for evaluating F is to use a lookup table that stores the values of F. Unfor-
tunately, an Evil Adversary has changed the value of 1/5 of the table entries when we
were not looking.
Describe a simple randomized algorithm that, given an input z, outputs a value that
equals F (z) with probability at least 1/2. Your algorithm should work for every value
of z, regardless of what values the Adversary changed. Your algorithm should use as
few lookups and as little computation as possible.
Suppose I allow you to repeat your initial algorithm three times. What should you
do in this case, and what is the probability that your enhanced algorithm returns the
correct answer?

Exercise 1.19: Give examples of events where Pr(A | B) < Pr(A), Pr(A | B) = Pr(A),
and Pr(A | B) > Pr(A).

Exercise 1.20: Show that, if E1 , E2 , . . . , En are mutually independent, then so are

Ē1 , Ē2 , . . . , Ēn .

Exercise 1.21: Give an example of three random events X, Y, Z for which any pair are
independent but all three are not mutually independent.

Exercise 1.22: (a) Consider the set {1, . . . , n}. We generate a subset X of this set as
follows: a fair coin is flipped independently for each element of the set; if the coin lands
heads then the element is added to X, and otherwise it is not. Argue that the resulting
set X is equally likely to be any one of the 2n possible subsets.
(b) Suppose that two sets X and Y are chosen independently and uniformly at
random from all the 2n subsets of {1, . . . , n}. Determine Pr(X ⊆ Y ) and Pr(X ∪ Y =
{1, . . . , n}). (Hint: Use the part (a) of this problem.)

Exercise 1.23: There may be several different min-cut sets in a graph. Using the
analysis of the randomized min-cut algorithm, argue that there can be at most
n(n − 1)/2 distinct min-cut sets.

Exercise 1.24: Generalizing on the notion of a cut-set, we define an r-way cut-set in a

graph as a set of edges whose removal breaks the graph into r or more connected com-
ponents. Explain how the randomized min-cut algorithm can be used to find minimum
r-way cut-sets, and bound the probability that it succeeds in one iteration.
21
events and probability

Exercise 1.25: To improve the probability of success of the randomized min-cut algo-
rithm, it can be run multiple times.
(a) Consider running the algorithm twice. Determine the number of edge contractions
and bound the probability of finding a min-cut.
(b) Consider the following variation. Starting with a graph with n vertices, first con-
tract the graph down to k vertices using the randomized min-cut algorithm. Make
copies of the graph with k vertices, and now run the randomized algorithm on this
reduced graph times, independently. Determine the number of edge contractions
and bound the probability of finding a minimum cut.
(c) Find optimal (or at least near-optimal) values of k and for the variation in (b) that
maximize the probability of finding a minimum cut while using the same number
of edge contractions as running the original algorithm twice.

Exercise 1.26: Tic-tac-toe always ends up in a tie if players play optimally. Instead,
we may consider random variations of tic-tac-toe.
(a) First variation: Each of the nine squares is labeled either X or O according to an
independent and uniform coin flip. If only one of the players has one (or more)
winning tic-tac-toe combinations, that player wins. Otherwise, the game is a tie.
Determine the probability that X wins. (You may want to use a computer program
to help run through the configurations.)
(b) Second variation: X and O take turns, with the X player going first. On the X
player’s turn, an X is placed on a square chosen independently and uniformly at
random from the squares that are still vacant; O plays similarly. The first player to
have a winning tic-tac-toe combination wins the game, and a tie occurs if neither
player achieves a winning combination. Find the probability that each player wins.
(Again, you may want to write a program to help you.)

22
chapter two
Discrete Random Variables
and Expectation

In this chapter, we introduce the concepts of discrete random variables and expectation
and then develop basic techniques for analyzing the expected performance of algo-
rithms. We apply these techniques to computing the expected running time of the well-
known Quicksort algorithm. In analyzing two versions of Quicksort, we demonstrate
the distinction between the analysis of randomized algorithms, where the probability
space is defined by the random choices made by the algorithm, and the probabilistic
analysis of deterministic algorithms, where the probability space is defined by some
probability distribution on the inputs.
Along the way we define the Bernoulli, binomial, and geometric random variables,
study the expected size of a simple branching process, and analyze the expectation of
the coupon collector’s problem – a probabilistic paradigm that reappears throughout
the book.

2.1. Random Variables and Expectation

When studying a random event, we are often interested in some value associated with
the random event rather than in the event itself. For example, in tossing two dice we
are often interested in the sum of the two dice rather than the separate value of each
die. The sample space in tossing two dice consists of 36 events of equal probability,
given by the ordered pairs of numbers {(1, 1), (1, 2), . . . , (6, 5), (6, 6)}. If the quantity
we are interested in is the sum of the two dice, then we are interested in 11 events (of
unequal probability): the 11 possible outcomes of the sum. Any such function from the
sample space to the real numbers is called a random variable.
Definition 2.1: A random variable X on a sample space is a real-valued (mea-
surable) function on ; that is, X : → R. A discrete random variable is a random
variable that takes on only a finite or countably infinite number of values.
Since random variables are functions, they are usually denoted by a capital letter such
as X or Y, while real numbers are usually denoted by lowercase letters.
23
discrete random variables and expectation

For a discrete random variable X and a real value a, the event “X = a” includes all
the basic events of the sample space in which the random variable X assumes the value
a. That is, “X = a” represents the set {s ∈ | X (s) = a}. We denote the probability of
that event by

Pr(X = a) = Pr(s).
s∈: X (s)=a

If X is the random variable representing the sum of the two dice, then the event X = 4
corresponds to the set of basic events {(1, 3), (2, 2), (3, 1)}. Hence
3 1
Pr(X = 4) = = .
36 12
The definition of independence that we developed for events extends to random
variables.
Definition 2.2: Two random variables X and Y are independent if and only if
Pr((X = x) ∩ (Y = y)) = Pr(X = x) · Pr(Y = y)
for all values x and y. Similarly, random variables X1 , X2 , . . . , Xk are mutually inde-
pendent if and only if, for any subset I ⊆ [1, k] and any values xi , i ∈ I,

Pr (Xi = xi ) = Pr(Xi = xi ).
i∈I i∈I

A basic characteristic of a random variable is its expectation, which is also often called
the mean. The expectation of a random variable is a weighted average of the values
it assumes, where each value is weighted by the probability that the variable assumes
that value.
Definition 2.3: The expectation of a discrete random variable X, denoted by E[X], is
given by

E[X] = i Pr(X = i),
i

where
the summation is over all values in the range of X. The expectation is finite if
i |i| Pr(X = i) converges; otherwise, the expectation is unbounded.
For example, the expectation of the random variable X representing the sum of two
dice is
1 2 3 1
E[X] = ·2+ ·3+ · 4 + ··· + · 12 = 7.
36 36 36 36
You may try using symmetry to give simpler argument for why E[X] = 7.
As an example of where the expectation of a discrete random variable is unbounded,
consider a random variable X that takes on the value 2i with probability 1/2i for i =
1, 2, . . . . The expected value of X is
∞ ∞
1 i
E[X] = 2 = 1 = ∞.
i=1
2i i=1

24
2.1 random variables and expectation

Here we use the somewhat informal notation E[X] = ∞ to express that E[X] is
unbounded.

2.1.1. Linearity of Expectations

A key property of expectation that significantly simplifies its computation is the linear-
ity of expectations. By this property, the expectation of the sum of random variables is
equal to the sum of their expectations. Formally, we have the following theorem.

Theorem 2.1 [Linearity of Expectations]: For any finite collection of discrete ran-
dom variables X1 , X2 , . . . , Xn with finite expectations,

n n
E Xi = E[Xi ].
i=1 i=1

Proof: We prove the statement for two random variables X and Y; the general case
follows by induction. The summations that follow are understood to be over the ranges
of the corresponding random variables:

E[X + Y ] = (i + j) Pr((X = i) ∩ (Y = j))
i j

= i Pr((X = i) ∩ (Y = j)) + j Pr((X = i) ∩ (Y = j))
i j i j

= i Pr((X = i) ∩ (Y = j)) + j Pr((X = i) ∩ (Y = j))
i j j i

= i Pr(X = i) + j Pr(Y = j)
i j
= E[X] + E[Y ].

The first equality follows from Definition 1.2. In the penultimate equation we have used
Theorem 1.6, the Law of Total Probability.

We now use this property to compute the expected sum of two standard dice. Let X =
X1 + X2 , where Xi represents the outcome of die i for i = 1, 2. Then

1
6
7
E[Xi ] = j= .
6 j=1 2

Applying the linearity of expectations, we have

E[X] = E[X1 ] + E[X2 ] = 7.

It is worth emphasizing that linearity of expectations holds for any collection of

random variables, even if they are not independent! For example, consider again the
25
discrete random variables and expectation

previous example and let the random variable Y = X1 + X12 . We have

E[Y ] = E X1 + X12 = E[X1 ] + E X12 ,
even though X1 and X12 are clearly dependent. As an exercise, you may verify this
identity by considering the six possible outcomes for X1 .
Linearity of expectations also holds for countably infinite summations in certain
cases. Specifically, it can be shown that
∞ ∞
E Xi = E[Xi ]
i=1 i=1
∞
whenever i=1 E[|Xi |] converges. The issue of dealing with the linearity of expecta-
tions with countably infinite summations is further considered in Exercise 2.29.
This chapter contains several examples in which the linearity of expectations signif-
icantly simplifies the computation of expectations. One result related to the linearity of
expectations is the following simple lemma.
Lemma 2.2: For any constant c and discrete random variable X,
E[cX] = cE[X].
Proof: The lemma is obvious for c = 0. For c = 0,

E[cX] = j Pr(cX = j)
j

=c ( j/c) Pr(X = j/c)
j

=c k Pr(X = k)
k
= cE[X].

2.1.2. Jensen’s Inequality

Suppose that we choose the length X of a side of a square uniformly at random from
the range [1, 99]. What is the expected value of the area? We can write this as E[X 2 ].
It is tempting to think of this as being equal to E[X]2 , but a simple calculation shows
that this is not correct. In fact, E[X]2 = 2500 whereas E[X 2 ] = 9950/3 > 2500.
More generally, we can prove that E[X 2 ] ≥ (E[X])2 . Consider Y = (X − E[X])2 .
The random variable Y is nonnegative and hence its expectation must also be non-
negative. Therefore,
0 ≤ E[Y ] = E[(X − E[X])2 ]
= E[X 2 − 2XE[X] + (E[X])2 ]
= E[X 2 ] − 2E[XE[X]] + (E[X])2
= E[X 2 ] − (E[X])2 .

26
2.2 the bernoulli and binomial random variables

To obtain the penultimate line, we used the linearity of expectations. To obtain the last
line we used Lemma 2.2 to simplify E[XE[X]] = E[X] · E[X].
The fact that E[X 2 ] ≥ (E[X])2 is an example of a more general theorem known as
Jensen’s inequality. Jensen’s inequality shows that, for any convex function f, we have
E[ f (X )] ≥ f (E[X]).

Definition 2.4: A function f : R → R is said to be convex if, for any x1 , x2 and 0 ≤

λ ≤ 1,

f (λx1 + (1 − λ)x2 ) ≤ λ f (x1 ) + (1 − λ) f (x2 ).

Visually, a convex function f has the property that, if you connect two points on the
graph of the function by a straight line, this line lies on or above the graph of the
function. The following fact, which we state without proof, is often a useful alternative
to Definition 2.4.

Lemma 2.3: If f is a twice differentiable function, then f is convex if and only if

f (x) ≥ 0.

Theorem 2.4 [Jensen’s Inequality]: If f is a convex function, then

E[ f (X )] ≥ f (E[X]).

Proof: We prove the theorem assuming that f has a Taylor expansion. Let μ = E[X].
By Taylor’s theorem there is a value c such that

f (c)(x − μ)2
f (x) = f (μ) + f (μ)(x − μ) +
2
≥ f (μ) + f (μ)(x − μ),

since f (c) > 0 by convexity. Taking expectations of both sides and applying linearity
of expectations and Lemma 2.2 yields the result:

E[ f (X )] ≥ E[ f (μ) + f (μ)(X − μ)]

= E[ f (μ)] + f (μ)(E[X] − μ)
= f (μ) = f (E[X]).

An alternative proof of Jensen’s inequality, which holds for any random variable X that
takes on only finitely many values, is presented in Exercise 2.10.

2.2. The Bernoulli and Binomial Random Variables

Suppose that we run an experiment that succeeds with probability p and fails with
probability 1 − p.

27
discrete random variables and expectation

Let Y be a random variable such that

1 if the experiment succeeds,
Y =
0 otherwise.

The variable Y is called a Bernoulli or an indicator random variable. Note that, for a
Bernoulli random variable,

E[Y ] = p · 1 + (1 − p) · 0 = p = Pr(Y = 1).

For example, if we flip a fair coin and consider the outcome “heads” a success, then
the expected value of the corresponding indicator random variable is 1/2.
Consider now a sequence of n independent coin flips. What is the distribution of the
number of heads in the entire sequence? More generally, consider a sequence of n inde-
pendent experiments, each of which succeeds with probability p. If we let X represent
the number of successes in the n experiments, then X has a binomial distribution.

Definition 2.5: A binomial random variable X with parameters n and p, denoted by

B(n, p), is defined by the following probability distribution on j = 0, 1, 2, . . . , n:

n
Pr(X = j) = p j (1 − p)n− j .
j

That is, the binomial random variable X equals j when there are exactly j successes and
n − j failures in n independent experiments, each of which is successful with proba-
bility p.
As an exercise, you should show that Definition 2.5 ensures that nj=0 Pr(X = j) =
1. This is necessary for the binomial random variable to have a valid probability func-
tion, according to Definition 1.2.
The binomial random variable arises in many contexts, especially in sampling. As a
practical example, suppose that we want to gather data about the packets going through
a router by postprocessing them. We might want to know the approximate fraction of
packets from a certain source or of a certain data type. We do not have the memory
available to store all of the packets, so we choose to store a random subset – or sample
– of the packets for later analysis. If each packet is stored with probability p and if n
packets go through the router each day, then the number of sampled packets each day
is a binomial random variable X with parameters n and p. If we want to know how
much memory is necessary for such a sample, a natural starting point is to determine
the expectation of the random variable X.
Sampling in this manner arises in other contexts as well. For example, by sampling
the program counter while a program runs, one can determine what parts of a program
are taking the most time. This knowledge can be used to aid dynamic program opti-
mization techniques such as binary rewriting, where the executable binary form of a
program is modified while the program executes. Since rewriting the executable as the
program runs is expensive, sampling helps the optimizer to determine when it will be
worthwhile.
28
2.3 conditional expectation

What is the expectation of a binomial random variable X? We can compute it directly

from the definition as
n
n
E[X] = j p j (1 − p)n− j
j=0
j

n
n!
= j p j (1 − p)n− j
j=0
j! (n − j)!

n
n!
= p j (1 − p)n− j
j=1
( j − 1)! (n − j)!

n
(n − 1)!
= np p j−1 (1 − p)(n−1)−( j−1)
j=1
( j − 1)! ((n − 1) − ( j − 1))!

n−1
(n − 1)!
= np pk (1 − p)(n−1)−k
k=0
k! ((n − 1) − k)!
n − 1
n−1
= np pk (1 − p)(n−1)−k
k=0
k
= np,

where the last equation uses the binomial identity

n
n
(x + y)n = xk yn−k .
k=0
k

The linearity of expectations allows for a significantly simpler argument. If X is a

binomial random variable with parameters n and p, then X is the number of successes
in n trials, where each trial is successful with probability p. Define a set of n indicator
random variables X1 , . . . , Xn ,
where Xi = 1 if the ith trial is successful and 0 otherwise.
Clearly, E[Xi ] = p and X = ni=1 Xi and so, by the linearity of expectations,

n n
E[X] = E Xi = E[Xi ] = np.
i=1 i=1

The linearity of expectations makes this approach of representing a random variable

by a sum of simpler random variables, such as indicator random variables, extremely
useful.

2.3. Conditional Expectation

Just as we have defined conditional probability, it is useful to define the conditional

expectation of a random variable. The following definition is quite natural.
29
discrete random variables and expectation

Definition 2.6: E[Y | Z = z] = y Pr(Y = y | Z = z),
y

where the summation is over all y in the range of Y.

The definition states that the conditional expectation of a random variable is, like the
expectation, a weighted sum of the values it assumes. The difference is that now each
value is weighted by the conditional probability that the variable assumes that value.
One can similarly define the conditional expectation of a random variable Y conditioned
on an event E as

E[Y | E] = y Pr(Y = y | E ).
y

For example, suppose that we independently roll two standard six-sided dice. Let X1
be the number that shows on the first die, X2 the number on the second die, and X the
sum of the numbers on the two dice. Then

8
1 11
E[X | X1 = 2] = x Pr(X = x | X1 = 2) = x· = .
x x=3
6 2

As another example, consider E[X1 | X = 5]:

4
E[X1 | X = 5] = x Pr(X1 = x | X = 5)
x=1
4
Pr(X1 = x ∩ X = 5) 1/36
4
5
= x = x = .
x=1
Pr(X = 5) x=1
4/36 2

The following natural identity follows from Definition 2.6.

Lemma 2.5: For any random variables X and Y,

E[X] = Pr(Y = y)E[X | Y = y],
y

where the sum is over all values in the range of Y and all of the expectations exist.

Proof: Pr(Y = y)E[X | Y = y] = Pr(Y = y) x Pr(X = x | Y = y)
y y x

= x Pr(X = x | Y = y) Pr(Y = y)
x y

= x Pr(X = x ∩ Y = y)
x y

= x Pr(X = x) = E[X].
x

The linearity of expectations also extends to conditional expectations. This is clarified

in Lemma 2.6, whose proof is left as Exercise 2.11.
30
2.3 conditional expectation

Lemma 2.6: For any finite collection of discrete random variables X1 , X2 , . . . , Xn with
finite expectations and for any random variable Y,
n n
E Xi | Y = y = E[Xi | Y = y].
i=1 i=1

Perhaps somewhat confusingly, the conditional expectation is also used to refer to the
following random variable.
Definition 2.7: The expression E[Y | Z] is a random variable f (Z) that takes on the
value E[Y | Z = z] when Z = z.
We emphasize that E[Y | Z] is not a real value; it is actually a function of the random
variable Z. Hence E[Y | Z] is itself a function from the sample space to the real numbers
and can therefore be thought of as a random variable.
In the previous example of rolling two dice,
1 +6
X
1 7
E[X | X1 ] = x Pr(X = x | X1 ) = x· = X1 + .
x x=X1 +1
6 2

We see that E[X | X1 ] is a random variable whose value depends on X1 .

If E[Y | Z] is a random variable, then it makes sense to consider its expectation
E[E[Y | Z]]. In our example, we found that E[X | X1 ] = X1 + 7/2. Thus

7 7 7
E[E[X | X1 ]] = E X1 + = + = 7 = E[X].
2 2 2
More generally, we have the following theorem.
Theorem 2.7:
E[Y ] = E[E[Y | Z]].
Proof: From Definition 2.7 we have E[Y | Z] = f (Z), where f (Z) takes on the value
E[Y | Z = z] when Z = z. Hence

E[E[Y | Z]] = E[Y | Z = z] Pr(Z = z).
z

The right-hand side equals E[Y ] by Lemma 2.5.

We now demonstrate an interesting application of conditional expectations. Consider

a program that includes one call to a process S. Assume that each call to process S
recursively spawns new copies of the process S, where the number of new copies is
a binomial random variable with parameters n and p. We assume that these random
variables are independent for each call to S. What is the expected number of copies of
the process S generated by the program?
To analyze this recursive spawning process, we introduce the idea of generations.
The initial process S is in generation 0. Otherwise, we say that a process S is in gen-
eration i if it was spawned by another process S in generation i − 1. Let Yi denote
31
discrete random variables and expectation

the number of S processes in generation i. Since we know that Y0 = 1, the number of

processes in generation 1 has a binomial distribution. Thus,
E[Y1 ] = np.
Similarly, suppose we knew that the number of processes in generation i − 1 was
yi−1 , so Yi−1 = yi−1 . Let Zk be the number of copies spawned by the kth process spawned
in the (i − 1)th generation for 1 ≤ k ≤ yi−1 . Each Zk is a binomial random variable with
parameters n and p. Then
yi−1
E[Yi | Yi−1 = yi−1 ] = E Zk | Yi−1 = yi−1
k=1

yi−1
= j Pr Zk = j | Yi−1 = yi−1
j≥0 k=1

yi−1
= j Pr Zk = j
j≥0 k=1

yi−1
=E Zk
k=1

yi−1
= E[Zk ]
k=1
= yi−1 np.
In the third line we have used that the Zk are all independent binomial random variables;
in particular, the value of each Zk is independent of Yi−1 , allowing us to remove the
conditioning. In the fifth line, we have applied the linearity of expectations.
Applying Theorem 2.7, we can compute the expected size of the ith generation
inductively. We have
E[Yi ] = E[E[Yi | Yi−1 ]] = E[Yi−1 np] = npE[Yi−1 ].
By induction on i, and using the fact that Y0 = 1, we then obtain
E[Yi ] = (np)i .
The expected total number of copies of process S generated by the program is
given by

E Yi = E[Yi ] = (np)i .
i≥0 i≥0 i≥0

If np ≥ 1 then the expectation is unbounded; if np < 1, the expectation is 1/(1 − np).

Thus, the expected number of processes generated by the program is bounded if and
only if the expected number of processes spawned by each process is less than 1.
The process analyzed here is a simple example of a branching process, a probabilis-
tic paradigm extensively studied in probability theory.
32
2.4 the geometric distribution

2.4. The Geometric Distribution

Suppose that we flip a coin until it lands on heads. What is the distribution of the number
of flips? This is an example of a geometric distribution, which arises in the following
situation: we perform a sequence of independent trials until the first success, where
each trial succeeds with probability p.

Definition 2.8: A geometric random variable X with parameter p is given by the fol-
lowing probability distribution on n = 1, 2, . . . ,:

Pr(X = n) = (1 − p)n−1 p.

That is, for the geometric random variable X to equal n, there must be n − 1 failures,
followed by a success.
As an exercise, you should show that the geometric random variable satisfies

Pr(X = n) = 1.
n≥1

Again, this is necessary for the geometric random variable to have a valid probability
function, according to Definition 1.2.
In the context of our example from Section 2.2 of sampling packets on a router, if
packets are sampled with probability p, then the number of packets transmitted after the
last sampled packet until and including the next sampled packet is given by a geometric
random variable with parameter p.
Geometric random variables are said to be memoryless because the probability that
you will reach your first success n trials from now is independent of the number of
failures you have experienced. Informally, one can ignore past failures because they do
not change the distribution of the number of future trials until first success. Formally,
we have the following statement.

Lemma 2.8: For a geometric random variable X with parameter p and for n > 0,

Pr(X = n + k | X > k) = Pr(X = n).

Proof: Pr((X = n + k) ∩ (X > k))

Pr(X = n + k | X > k) =
Pr(X > k)
Pr(X = n + k)
=
Pr(X > k)
(1 − p)n+k−1 p
= ∞
i=k (1 − p) p
i

(1 − p)n+k−1 p
=
(1 − p)k
= (1 − p)n−1 p
= Pr(X = n).

The fourth equality uses the fact that, for 0 < x < 1, ∞i=k x = x /(1 − x).
i k

33
discrete random variables and expectation

We now turn to computing the expectation of a geometric random variable. When a

random variable takes values in the set of natural numbers N = {0, 1, 2, 3, . . . }, there
is an alternative formula for calculating its expectation.

Lemma 2.9: Let X be a discrete random variable that takes on only nonnegative inte-
ger values. Then
∞

E[X] = Pr(X ≥ i).
i=1

∞
∞
∞
Proof: Pr(X ≥ i) = Pr(X = j)
i=1 i=1 j=i
∞
j
= Pr(X = j)
j=1 i=1
∞

= j Pr(X = j)
j=1
= E[X].

The interchange of (possibly) infinite summations is justified, since the terms being
summed are all nonnegative.

For a geometric random variable X with parameter p,

∞

Pr(X ≥ i) = (1 − p)n−1 p = (1 − p)i−1 .
n=i

Hence
∞

E[X] = Pr(X ≥ i)
i=1
∞
= (1 − p)i−1
i=1
1
=
1 − (1 − p)
1
= .
p
Thus, for a fair coin where p = 1/2, on average it takes two flips to see the first
heads.
There is another approach to finding the expectation of a geometric random variable
X with parameter p – one that uses conditional expectations and the memoryless prop-
erty of geometric random variables. Recall that X corresponds to the number of flips
until the first heads given that each flip is heads with probability p. Let Y = 0 if the first
34
2.4 the geometric distribution

flip is tails and Y = 1 if the first flip is heads. By the identity from Lemma 2.5,

E[X] = Pr(Y = 0)E[X | Y = 0] + Pr(Y = 1)E[X | Y = 1]

= (1 − p)E[X | Y = 0] + pE[X | Y = 1].

If Y = 1 then X = 1, so E[X | Y = 1] = 1. If Y = 0, then X > 1. In this case, let

the number of remaining flips (after the first flip until the first heads) be Z. Then, by the
linearity of expectations,

E[X] = (1 − p)E[Z + 1] + p · 1 = (1 − p)E[Z] + 1.

By the memoryless property of geometric random variables, Z is also a geometric ran-

dom variable with parameter p. Hence E[Z] = E[X], since they both have the same
distribution. We therefore have

E[X] = (1 − p)E[Z] + 1 = (1 − p)E[X] + 1,

which yields E[X] = 1/p.

This method of using conditional expectations to compute an expectation is often
useful, especially in conjunction with the memoryless property of a geometric random
variable.

2.4.1. Example: Coupon Collector’s Problem

The coupon collector’s problem arises from the following scenario. Suppose that each
box of cereal contains one of n different coupons. Once you obtain one of every type
of coupon, you can send in for a prize. Assuming that the coupon in each box is chosen
independently and uniformly at random from the n possibilities and that you do not
collaborate with others to collect coupons, how many boxes of cereal must you buy
before you obtain at least one of every type of coupon? This simple problem arises in
many different scenarios and will reappear in several places in the book.
Let X be the number of boxes bought until at least one of every type of coupon is
obtained. We now determine E[X]. If Xi is the number of boxes bought while you had
exactly i − 1 different coupons, then clearly X = ni=1 Xi .
The advantage of breaking the random variable X into a sum of n random variables
Xi , i = 1, . . . , n, is that each Xi is a geometric random variable. When exactly i − 1
coupons have been found, the probability of obtaining a new coupon is

i−1
pi = 1 − .
n

Hence, Xi is a geometric random variable with parameter pi , and

1 n
E[Xi ] = = .
pi n−i+1
35
discrete random variables and expectation

Using the linearity of expectations, we have that

n
E[X] = E Xi
i=1

n
= E[Xi ]
i=1
n
n
=
i=1
n−i+1
n
1
=n .
i=1
i
n
The summation i=1 1/i is known as the harmonic number H(n), and as we show
next, H(n) = ln n + (1). Thus, for the coupon collector’s problem, the expected num-
ber of random coupons required to obtain all n coupons is n ln n + (n).

Lemma 2.10: The harmonic number H(n) = ni=1 1/i satisfies H(n) = ln n + (1).

Proof: Since 1/x is monotonically decreasing, we can write

n n
1 1
ln n = dx ≤
x=1 x k=1
k

and

n n
1 1
≤ dx = ln n.
k=2
k x=1 x

This is clarified in Figure 2.1, where the area below the curve f (x) = 1/x corre-
sponds
n to the integral
n and the areas of the shaded regions correspond to the summations
k=1 1/k and k=2 1/k.
Hence ln n ≤ H(n) ≤ ln n + 1, proving the claim.

As a simple application of the coupon collector’s problem, suppose that packets are
sent in a stream from a source host to a destination host along a fixed path of routers.
The host at the destination would like to know which routers the stream of packets has
passed through, in case it finds later that some router damaged packets that it processed.
If there is enough room in the packet header, each router can append its identification
number to the header, giving the path. Unfortunately, there may not be that much room
available in the packet header.
Suppose instead that each packet header has space for exactly one router identi-
fication number, and this space is used to store the identification of a router chosen
uniformly at random from all of the routers on the path. This can actually be accom-
plished easily; we consider how in Exercise 2.18. Then, from the point of view of the
destination host, determining all the routers on the path is like a coupon collector’s
problem. If there are n routers along the path, then the expected number of packets in
36
2.5 application: the expected run-time of quicksort

– –
+ +

– + – +

Figure 2.1: Approximating the area above and below f (x) = 1/x.

the stream that must arrive before the destination host knows all of the routers on the
path is nH(n) = n ln n + (n).

2.5. Application: The Expected Run-Time of Quicksort

Quicksort is a simple – and, in practice, very efficient – sorting algorithm. The input is
a list of n numbers x1 , x2 , . . . , xn . For convenience, we will assume that the numbers
are distinct. A call to the Quicksort function begins by choosing a pivot element from
the set. Let us assume the pivot is x. The algorithm proceeds by comparing every other
element to x, dividing the list of elements into two sublists: those that are less than x
and those that are greater than x. Notice that if the comparisons are performed in the
natural order, from left to right, then the order of the elements in each sublist is the
same as in the initial list. Quicksort then recursively sorts these sublists.
In the worst case, Quicksort requires (n2 ) comparison operations. For example,
suppose our input has the form x1 = n, x2 = n − 1, . . . , xn−1 = 2, xn = 1. Suppose
also that we adopt the rule that the pivot should be the first element of the list. The
first pivot chosen is then n, so Quicksort performs n − 1 comparisons. The division
has yielded one sublist of size 0 (which requires no additional work) and another of
size n − 1, with the order n − 1, n − 2, . . . , 2, 1. The next pivot chosen is n − 1, so
Quicksort performs n − 2 comparisons and is left with one group of size n − 2 in the
order n − 2, n − 3, . . . , 2, 1. Continuing in this fashion, Quicksort performs
n(n − 1)
(n − 1) + (n − 2) + · · · + 2 + 1 = comparisons.
2
This is not the only bad case that leads to (n2 ) comparisons; similarly poor perfor-
mance occurs if the pivot element is chosen from among the smallest few or the largest
few elements each time.
37
discrete random variables and expectation

Quicksort Algorithm:

Input: A list S = {x1 , . . . , xn } of n distinct elements over a totally ordered universe.

Output: The elements of S in sorted order.
1. If S has one or zero elements, return S. Otherwise continue.
2. Choose an element of S as a pivot; call it s.
3. Compare every other element of S to x in order to divide the other elements into
two sublists:
(a) S1 has all the elements of S that are less than x;
(b) S2 has all those that are greater than x.
4. Use Quicksort to sort S1 and S2 .
5. Return the list S1 , x, S2 .

Algorithm 2.1: Quicksort.

We clearly made a bad choice of pivots for the given input. A reasonable choice of
pivots would require many fewer comparisons. For example, if our pivot always splits
the list into two sublists of size at most n/2, then the number of comparisons C(n)
would obey the following recurrence relation:
C(n) ≤ 2C(n/2) + (n).
The solution to this equation yields C(n) = O(n log n), which is the best possible result
for comparison-based sorting. In fact, any sequence of pivot elements that always split
the input list into two sublists each of size at least cn for some constant c would yield
an O(n log n) running time.
This discussion provides some intuition for how we would like pivots to be chosen.
In each iteration of the algorithm there is a good set of pivot elements that split the
input list into two almost equal sublists; it suffices if the sizes of the two sublists are
within a constant factor of each other. There is also a bad set of pivot elements that do
not split up the list significantly. If good pivots are chosen sufficiently often, Quicksort
will terminate quickly. How can we guarantee that the algorithm chooses good pivot
elements sufficiently often? We can resolve this problem in one of two ways.
First, we can change the algorithm to choose the pivots randomly. This makes Quick-
sort a randomized algorithm; the randomization makes it extremely unlikely that we
repeatedly choose the wrong pivots. We demonstrate shortly that the expected number
of comparisons made by a simple randomized Quicksort is 2n ln n + O(n), matching
(up to constant factors) the (n log n) bound for comparison-based sorting. Here, the
expectation is over the random choice of pivots.
A second possibility is that we can keep our deterministic algorithm, using the first
list element as a pivot, but consider a probabilistic model of the inputs. A permutation
of a set of n distinct items is just one of the n! orderings of these items. Instead of
looking for the worst possible input, we assume that the input items are given to us in
a random order. This may be a reasonable assumption for some applications; alterna-
tively, this could be accomplished by ordering the input list according to a randomly
38
2.5 application: the expected run-time of quicksort

chosen permutation before running the deterministic Quicksort algorithm. In this case,
we have a deterministic algorithm but a probabilistic analysis based on a model of the
inputs. We again show in this setting that the expected number of comparisons made
is 2n ln n + O(n). Here, the expectation is over the random choice of inputs.
The same techniques are generally used both in analyses of randomized algorithms
and in probabilistic analyses of deterministic algorithms. Indeed, in this application the
analysis of the randomized Quicksort and the probabilistic analysis of the deterministic
Quicksort under random inputs are essentially the same.
Let us first analyze Random Quicksort, the randomized algorithm version of
Quicksort.

Theorem 2.11: Suppose that, whenever a pivot is chosen for Random Quicksort, it
is chosen independently and uniformly at random from all possibilities. Then, for any
input, the expected number of comparisons made by Random Quicksort is 2n ln n +
O(n).

Proof: Let y1 , y2 , . . . , yn be the same values as the input values x1 , x2 , . . . , xn but sorted
in increasing order. For i < j, let Xi j be a random variable that takes on the value 1 if
yi and y j are compared at any time over the course of the algorithm, and 0 otherwise.
Then the total number of comparisons X satisfies

n−1
n
X= Xi j ,
i=1 j=i+1

and

n−1
n
E[X] = E Xi j
i=1 j=i+1

n−1
n
= E[Xi j ]
i=1 j=i+1

by the linearity of expectations.

Since Xi j is an indicator random variable that takes on only the values 0 and 1,
E[Xi j ] is equal to the probability that Xi j is 1. Hence all we need to do is compute the
probability that two elements yi and y j are compared. Now, yi and y j are compared if
and only if either yi or y j is the first pivot selected by Random Quicksort from the set
Y i j = {yi , yi+1 , . . . , y j−1 , y j }. This is because if yi (or y j ) is the first pivot selected from
this set, then yi and y j must still be in the same sublist, and hence they will be compared.
Similarly, if neither is the first pivot from this set, then yi and y j will be separated into
distinct sublists and so will not be compared.
Since our pivots are chosen independently and uniformly at random from each sub-
list, it follows that, the first time a pivot is chosen from Y i j , it is equally likely to be
any element from this set. Thus the probability that yi or y j is the first pivot selected
from Y i j , which is the probability that Xi j = 1, is 2/( j − i + 1). Using the substitution
39
discrete random variables and expectation

k = j − i + 1 then yields

n−1
n
2
E[X] =
i=1 j=i+1
j−i+1

n−1 n−i+1
2
=
i=1 k=2
k
2
n n+1−k
=
k=2 i=1
k
n
2
= (n + 1 − k)
k=2
k
2
n
= (n + 1) − 2(n − 1)
k=2
k

n
1
= (2n + 2) − 4n.
k=1
k

Notice that we used a rearrangement of the double summation to obtain a clean form
for the expectation.
Recalling that the summation H(n) = nk=1 1/k satisfies H(n) = ln n + (1), we
have E[X] = 2n ln n + (n).
Next we consider the deterministic version of Quicksort, on random input. We assume
that the order of the elements in each recursively constructed sublist is the same as in
the initial list.
Theorem 2.12: Suppose that, whenever a pivot is chosen for Quicksort, the first ele-
ment of the sublist is chosen. If the input is chosen uniformly at random from all pos-
sible permutations of the values, then the expected number of comparisons made by
Deterministic Quicksort is 2n ln n + O(n).
Proof: The proof is essentially the same as for Random Quicksort. Again, yi and y j
are compared if and only if either yi or y j is the first pivot selected by Quicksort from
the set Y i j . Since the order of elements in each sublist is the same as in the original list,
the first pivot selected from the set Y i j is just the first element from Y i j in the input list,
and since all possible permutations of the input values are equally likely, every element
in Y i j is equally likely to be first. From this, we can again use linearity of expectations
in the same way as in the analysis of Random Quicksort to obtain the same expression
for E[X].

2.6. Exercises

Exercise 2.1: Suppose we roll a fair k-sided die with the numbers 1 through k on the
die’s faces. If X is the number that appears, what is E[X]?
40
2.6 exercises

Exercise 2.2: A monkey types on a 26-letter keyboard that has lowercase letters only.
Each letter is chosen independently and uniformly at random from the alphabet. If the
monkey types 1,000,000 letters, what is the expected number of times the sequence
“proof” appears?

Exercise 2.3: Give examples of functions f and random variables X where E[ f (X )] <
f (E[X]), E[ f (X )] = f (E[X]), and E[ f (X )] > f (E[X]).

Exercise 2.4: Prove that E[X k ] ≥ E[X]k for any even integer k ≥ 1.

Exercise 2.5: If X is a B(n, 1/2) random variable with n ≥ 1, show that the probability
that X is even is 1/2.

Exercise 2.6: Suppose that we independently roll two standard six-sided dice. Let X1
be the number that shows on the first die, X2 the number on the second die, and X the
sum of the numbers on the two dice.

(a) What is E[X | X1 is even]?

(b) What is E[X | X1 = X2 ]?
(c) What is E[X1 | X = 9]?
(d) What is E[X1 − X2 | X = k] for k in the range [2, 12]?

Exercise 2.7: Let X and Y be independent geometric random variables, where X has
parameter p and Y has parameter q.

(a) What is the probability that X = Y ?

(b) What is E[max(X, Y )]?
(c) What is Pr(min(X, Y ) = k)?
(d) What is E[X | X ≤ Y ]?

You may find it helpful to keep in mind the memoryless property of geometric random
variables.

Exercise 2.8: (a) Alice and Bob decide to have children until either they have their first
girl or they have k ≥ 1 children. Assume that each child is a boy or girl independently
with probability 1/2 and that there are no multiple births. What is the expected number
of female children that they have? What is the expected number of male children that
they have?
(b) Suppose Alice and Bob simply decide to keep having children until they have
their first girl. Assuming that this is possible, what is the expected number of boys that
they have?

Exercise 2.9: (a) Suppose that we roll twice a fair k-sided die with the numbers 1
through k on the die’s faces, obtaining values X1 and X2 . What is E[max(X1 , X2 )]?
What is E[min(X1 , X2 )]?
41
discrete random variables and expectation

(b) Show from your calculations in part (a) that

E[max(X1 , X2 )] + E[min(X1 , X2 )] = E[X1 ] + E[X2 ]. (2.1)
(c) Explain why Eqn. (2.1) must be true by using the linearity of expectations instead
of a direct computation.

Exercise 2.10: (a) Show by induction that if f : R → R is convex then, for any
x1 , x2 , . . . , xn and λ1 , λ2 , . . . , λn with ni=1 λi = 1,
n n
f λi xi ≤ λi f (xi ). (2.2)
i=1 i=1

(b) Use Eqn. (2.2) to prove that if f : R → R is convex then

E[ f (X )] ≥ f (E[X])
for any random variable X that takes on only finitely many values.

Exercise 2.11: Prove Lemma 2.6.

Exercise 2.12: We draw cards uniformly at random with replacement from a deck of
n cards. What is the expected number of cards we must draw until we have seen all n
cards in the deck? If we draw 2n cards, what is the expected number of cards in the
deck that are not chosen at all? Chosen exactly once?

Exercise 2.13: (a) Consider the following variation of the coupon collector’s problem.
Each box of cereal contains one of 2n different coupons. The coupons are organized
into n pairs, so that coupons 1 and 2 are a pair, coupons 3 and 4 are a pair, and so on.
Once you obtain one coupon from every pair, you can obtain a prize. Assuming that
the coupon in each box is chosen independently and uniformly at random from the 2n
possibilities, what is the expected number of boxes you must buy before you can claim
the prize?
(b) Generalize the result of the problem in part (a) for the case where there are kn
different coupons, organized into n disjoint sets of k coupons, so that you need one
coupon from every set.

Exercise 2.14: The geometric distribution arises as the distribution of the number of
times we flip a coin until it comes up heads. Consider now the distribution of the number
of flips X until the kth head appears, where each coin flip comes up heads independently
with probability p. Prove that this distribution is given by

n−1
Pr(X = n) = pk (1 − p)n−k
k−1
for n ≥ k. (This is known as the negative binomial distribution.)

Exercise 2.15: For a coin that comes up heads independently with probability p on
each flip, what is the expected number of flips until the kth heads?
42
2.6 exercises

Exercise 2.16: Suppose we flip a coin n times to obtain a sequence of flips X1 ,

X2 , . . . , Xn . A streak of flips is a consecutive subsequence of flips that are all the same.
For example, if X3 , X4 , and X5 are all heads, there is a streak of length 3 starting at the
third flip. (If X6 is also heads, then there is also a streak of length 4 starting at the third
flip.)

(a) Let n be a power of 2. Show that the expected number of streaks of length log2 n + 1
is 1 − o(1).
(b) Show that, for sufficiently large n, the probability that there is no streak of length
at least log2 n − 2 log2 log2 n is less than 1/n. (Hint: Break the sequence of flips
up into disjoint blocks of log2 n − 2 log2 log2 n consecutive flips, and use that the
event that one block is a streak is independent of the event that any other block is
a streak.)

Exercise 2.17: Recall the recursive spawning process described in Section 2.3. Sup-
pose that each call to process S recursively spawns new copies of the process S, where
the number of new copies is 2 with probability p and 0 with probability 1 − p. If Yi
denotes the number of copies of S in the ith generation, determine E[Yi ]. For what
values of p is the expected total number of copies bounded?

Exercise 2.18: The following approach is often called reservoir sampling. Suppose
we have a sequence of items passing by one at a time. We want to maintain a sample
of one item with the property that it is uniformly distributed over all the items that we
have seen at each step. Moreover, we want to accomplish this without knowing the total
number of items in advance or storing all of the items that we see.
Consider the following algorithm, which stores just one item in memory at all times.
When the first item appears, it is stored in the memory. When the kth item appears, it
replaces the item in memory with probability 1/k. Explain why this algorithm solves
the problem.

Exercise 2.19: Suppose that we modify the reservoir sampling algorithm of Exer-
cise 2.18 so that, when the kth item appears, it replaces the item in memory with prob-
ability 1/2. Describe the distribution of the item in memory.

Exercise 2.20: A permutation on the numbers [1, n] can be represented as a function

π : [1, n] → [1, n], where π (i) is the position of i in the ordering given by the permu-
tation. A fixed point of a permutation π : [1, n] → [1, n] is a value for which π (x) = x.
Find the expected number of fixed points of a permutation chosen uniformly at random
from all permutations.

Exercise 2.21: Let a1 , a2 , . . . , an be a random permutation of {1, 2, . . . , n}, equally

likely to be any of the n! possible permutations. When sorting the list a1 , a2 , . . . , an ,
the element ai must move a distance of |ai − i| places from its current position to reach
43
discrete random variables and expectation

its position in the sorted order. Find

n
E |ai − i| ,
i=1

the expected total distance that elements will have to be moved.

Exercise 2.22: Let a1 , a2 , . . . , an be a list of n distinct numbers. We say that ai and

a j are inverted if i < j but ai > a j . The Bubblesort sorting algorithm swaps pairwise
adjacent inverted numbers in the list until there are no more inversions, so the list is
in sorted order. Suppose that the input to Bubblesort is a random permutation, equally
likely to be any of the n! permutations of n distinct numbers. Determine the expected
number of inversions that need to be corrected by Bubblesort.

Exercise 2.23: Linear insertion sort can sort an array of numbers in place. The first
and second numbers are compared; if they are out of order, they are swapped so that
they are in sorted order. The third number is then placed in the appropriate place in the
sorted order. It is first compared with the second; if it is not in the proper order, it is
swapped and compared with the first. Iteratively, the kth number is handled by swap-
ping it downward until the first k numbers are in sorted order. Determine the expected
number of swaps that need to be made with a linear insertion sort when the input is a
random permutation of n distinct numbers.

Exercise 2.24: We roll a standard fair die over and over. What is the expected number
of rolls until the first pair of consecutive sixes appears? (Hint: The answer is not 36.)

Exercise 2.25: A blood test is being performed on n individuals. Each person can
be tested separately, but this is expensive. Pooling can decrease the cost. The blood
samples of k people can be pooled and analyzed together. If the test is negative, this
one test suffices for the group of k individuals. If the test is positive, then each of the k
persons must be tested separately and thus k + 1 total tests are required for the k people.
Suppose that we create n/k disjoint groups of k people (where k divides n) and use the
pooling method. Assume that each person has a positive result on the test independently
with probability p.

(a) What is the probability that the test for a pooled sample of k people will be positive?
(b) What is the expected number of tests necessary?
(c) Describe how to find the best value of k.
(d) Give an inequality that shows for what values of p pooling is better than just testing
every individual.

Exercise 2.26: A permutation π : [1, n] → [1, n] can be represented as a set of cycles

as follows. Let there be one vertex for each number i, i = 1, . . . , n. If the permutation
maps the number i to the number π (i), then a directed arc is drawn from vertex i to
vertex π (i). This leads to a graph that is a set of disjoint cycles. Notice that some of
44
2.6 exercises

the cycles could be self-loops. What is the expected number of cycles in a random
permutation of n numbers?

Exercise 2.27: Consider the following distribution on

∞the integers x ≥ 1: Pr(X = x)
= (6/π 2 )x−2 . This is a valid distribution, since k=1 k
−2
= π 2 /6. What is its
expectation?

Exercise 2.28: Consider a simplified version of roulette in which you wager x dollars
on either red or black. The wheel is spun, and you receive your original wager plus
another x dollars if the ball lands on your color; if the ball doesn’t land on your color,
you lose your wager. Each color occurs independently with probability 1/2. (This is a
simplification because real roulette wheels have one or two spaces that are neither red
nor black, so the probability of guessing the correct color is actually less than 1/2.)
The following gambling strategy is a popular one. On the first spin, bet 1 dollar. If
you lose, bet 2 dollars on the next spin. In general, if you have lost on the first k − 1
spins, bet 2k−1 dollars on the kth spin. Argue that by following this strategy you will
eventually win a dollar. Now let X be the random variable that measures your maximum
loss before winning (i.e., the amount of money you have lost before the play on which
you win). Show that E[X] is unbounded. What does it imply about the practicality of
this strategy?

Exercise 2.29: Prove that, if X0 , X1 , . . . is a sequence of random variables such that

∞

E[|X j |]
j=0

converges, then the linearity of expectations holds:

∞ ∞
E Xj = E[X j ].
j=0 j=0

Exercise 2.30: In the roulette problem of Exercise 2.28, we found that with probability
1 you eventually win a dollar. Let X j be the amount you win on the jth bet. (This
might be 0 if you have already won a previous bet.) Determine E[X j ] and show that,
by applying the linearity of expectations, you find your expected winnings are 0. Does
the linearity of expectations hold in this case? (Compare with Exercise 2.29.)

Exercise 2.31: A variation on the roulette problem of Exercise 2.28 is the following.
We repeatedly flip a fair coin. You pay j dollars to play the game. If the first head comes
up on the kth flip, you win 2k /k dollars. What are your expected winnings? How much
would you be willing to pay to play the game?

Exercise 2.32: You need a new staff assistant, and you have n people to interview. You
want to hire the best candidate for the position. When you interview a candidate, you
can give them a score, with the highest score being the best and no ties being possible.
45
discrete random variables and expectation

You interview the candidates one by one. Because of your company’s hiring practices,
after you interview the kth candidate, you either offer the candidate the job before the
next interview or you forever lose the chance to hire that candidate. We suppose the
candidates are interviewed in a random order, chosen uniformly at random from all n!
possible orderings.
We consider the following strategy. First, interview m candidates but reject them all;
these candidates give you an idea of how strong the field is. After the mth candidate,
hire the first candidate you interview who is better than all of the previous candidates
you have interviewed.
(a) Let E be the event that we hire the best assistant, and let Ei be the event that ith
candidate is the best and we hire him. Determine Pr(Ei ), and show that
m
n
1
Pr(E ) = .
n j=m+1 j − 1
n 1
(b) Bound j=m+1 j−1 to obtain
m m
(ln n − ln m) ≤ Pr(E ) ≤ (ln(n − 1) − ln(m − 1)).
n n
(c) Show that m(ln n − ln m)/n is maximized when m = n/e, and explain why this
means Pr(E ) ≥ 1/e for this choice of m.

46
chapter three
Moments and Deviations

In this and the next chapter we examine techniques for bounding the tail distribution,
the probability that a random variable assumes values that are far from its expectation.
In the context of analysis of algorithms, these bounds are the major tool for estimating
the failure probability of algorithms and for establishing high probability bounds on
their run-time. In this chapter we study Markov’s and Chebyshev’s inequalities and
demonstrate their application in an analysis of a randomized median algorithm. The
next chapter is devoted to the Chernoff bound and its applications.

3.1. Markov’s Inequality

Markov’s inequality, formulated in the next theorem, is often too weak to yield useful
results, but it is a fundamental tool in developing more sophisticated bounds.
Theorem 3.1 [Markov’s Inequality]: Let X be a random variable that assumes only
nonnegative values. Then, for all a > 0,
E[X]
Pr(X ≥ a) ≤ .
a
Proof: For a > 0, let

1 if X ≥ a,
I=
0 otherwise,
and note that, since X ≥ 0,
X
I≤ . (3.1)
a
Because I is a 0–1 random variable, E[I] = Pr(I = 1) = Pr(X ≥ a).
Taking expectations in (3.1) thus yields

X E[X]
Pr(X ≥ a) = E[I] ≤ E = .
a a

47
moments and deviations

For example, suppose we use Markov’s inequality to bound the probability of obtaining
more than 3n/4 heads in a sequence of n fair coin flips. Let

1 if the ith coin flip is heads,
Xi =
0 otherwise,
n
and let X = i=1 Xi denote the number of heads in the n coin flips. Since E[Xi ] =
Pr(Xi = 1) = 1/2, it follows that E[X] = ni=1 E[Xi ] = n/2. Applying Markov’s
inequality, we obtain
E[X] n/2 2
Pr(X ≥ 3n/4) ≤ = = .
3n/4 3n/4 3

3.2. Variance and Moments of a Random Variable

Markov’s inequality gives the best tail bound possible when all we know is the expec-
tation of the random variable and that the variable is nonnegative (see Exercise 3.16). It
can be improved upon if more information about the distribution of the random variable
is available.
Additional information about a random variable is often expressed in terms of its
moments. The expectation is also called the first moment of a random variable. More
generally, we define the moments of a random variable as follows.
Definition 3.1: The kth moment of a random variable X is E[X k ].
A significantly stronger tail bound is obtained when the second moment (E[X 2 ]) is also
available. Given the first and second moments, one can compute the variance and stan-
dard deviation of the random variable. Intuitively, the variance and standard deviation
offer a measure of how far the random variable is likely to be from its expectation.
Definition 3.2: The variance of a random variable X is defined as
Var[X] = E[(X − E[X])2 ] = E[X 2 ] − (E[X])2 .
The standard deviation of a random variable X is

σ [X] = Var[X].
The two forms of the variance in the definition are equivalent, as is easily seen by using
the linearity of expectations. Keeping in mind that E[X] is a constant, we have
E[(X − E[X])2 ] = E[X 2 − 2XE[X] + E[X]2 ]
= E[X 2 ] − 2E[XE[X]] + E[X]2
= E[X 2 ] − 2E[X]E[X] + E[X]2
= E[X 2 ] − (E[X])2 .
If a random variable X is constant – so that it always assumes the same value – then
its variance and standard deviation are both zero. More generally, if a random vari-
able X takes on the value kE[X] with probability 1/k and the value 0 with probability
48
3.2 variance and moments of a random variable

√
1 − 1/k, then its variance is (k − 1)(E[X])2 and its standard deviation is k − 1E[X].
These cases help demonstrate the intuition that the variance (and standard deviation)
of a random variable are small when the random variable assumes values close to its
expectation and are large when it assumes values far from its expectation.
We have previously seen that the expectation of the sum of two random variables is
equal to the sum of their individual expectations. It is natural to ask whether the same
is true for the variance. We find that the variance of the sum of two random variable
has an extra term, called the covariance.
Definition 3.3: The covariance of two random variables X and Y is
Cov(X, Y ) = E[(X − E[X])(Y − E[Y ])].
Theorem 3.2: For any two random variables X and Y,
Var[X + Y ] = Var[X] + Var[Y ] + 2 Cov(X, Y ).
Proof:
Var[X + Y ] = E[(X + Y − E[X + Y ])2 ]
= E[(X + Y − E[X] − E[Y ])2 ]
= E[(X − E[X])2 + (Y − E[Y ])2 + 2(X − E[X])(Y − E[Y ])]
= E[(X − E[X])2 ] + E[(Y − E[Y ])2 ] + 2E[(X − E[X])(Y − E[Y ])]
= Var[X] + Var[Y ] + 2 Cov(X, Y ).
The extension of this theorem to a sum of any finite number of random variables is
proven in Exercise 3.14.
The variance of the sum of two (or any finite number of) random variables does equal
the sum of the variances when the random variables are independent. Equivalently, if X
and Y are independent random variables, then their covariance is equal to zero. To prove
this result, we first need a result about the expectation of the product of independent
random variables.
Theorem 3.3: If X and Y are two independent random variables, then
E[X · Y ] = E[X] · E[Y ].
Proof: In the summations that follow, let i take on all values in the range of X, and let
j take on all values in the range of Y:

E[X · Y ] = (i · j) · Pr((X = i) ∩ (Y = j))
i j

= (i · j) · Pr(X = i) · Pr(Y = j)
i j

= i · Pr(X = i) j · Pr(Y = j)
i j
= E[X] · E[Y ],
where the independence of X and Y is used in the second line.

49
moments and deviations

Unlike the linearity of expectations, which holds for the sum of random variables
whether they are independent or not, the result that the expectation of the product of
two (or more) random variables is equal to the product of their expectations does not
necessarily hold if the random variables are dependent. To see this, let Y and Z each
correspond to fair coin flips, with Y and Z taking on the value 0 if the flip is heads
and 1 if the flip is tails. Then E[Y ] = E[Z] = 1/2. If the two flips are independent,
then Y · Z is 1 with probability 1/4 and 0 otherwise, so indeed E[Y · Z] = E[Y ] · E[Z].
Suppose instead that the coin flips are dependent in the following way: the coins are
tied together, so Y and Z either both come up heads or both come up tails together. Each
coin considered individually is still a fair coin flip, but now Y · Z is 1 with probability
1/2 and so E[Y · Z] = E[Y ] · E[Z].

Corollary 3.4: If X and Y are independent random variables, then

Cov(X, Y ) = 0

and

Var[X + Y ] = Var[X] + Var[Y ].

Proof:

Cov(X, Y ) = E[(X − E[X])(Y − E[Y ])]

= E[X − E[X]] · E[Y − E[Y ]]
= 0.

In the second equation we have used the fact that, since X and Y are independent, so
are X − E[X] and Y − E[Y ] and hence Theorem 3.3 applies. For the last equation we
use the fact that, for any random variable Z,

E[(Z − E[Z])] = E[Z] − E[E[Z]] = 0.

Since Cov(X, Y ) = 0, we have Var[X + Y ] = Var[X] + Var[Y ].

By induction we can extend the result of Corollary 3.4 to show that the variance of
the sum of any finite number of independent random variables equals the sum of their
variances.

Theorem 3.5: Let X1 , X2 , . . . , Xn be mutually independent random variables. Then

n n
Var Xi = Var[Xi ].
i=1 i=1

50
3.3 chebyshev’s inequality

3.2.1. Example: Variance of a Binomial Random Variable

The variance of a binomial random variable X with parameters n and p can be deter-
mined directly by computing E[X 2 ]:
n
n
E[X ] =
2
p j (1 − p)n− j j2
j=0
j

n
n!
= p j (1 − p)n− j (( j2 − j) + j)
j=0
(n − j)! j!
n
n! ( j2 − j) j n
n! j
= p (1 − p)n− j + p j (1 − p)n− j
j=0
(n − j)! j! j=0
(n − j)! j!

n
(n − 2)!
= n(n − 1)p 2
p j−2 (1 − p)n− j
j=2
(n − j)! ( j − 2)!

n
(n − 1)!
+ np p j−1 (1 − p)n− j
j=1
(n − j)! ( j − 1)!

= n(n − 1)p2 + np.

Here we have simplified the summations by using the binomial theorem. We conclude
that
Var[X] = E[X 2 ] − (E[X])2
= n(n − 1)p2 + np − n2 p2
= np − np2
= np(1 − p).
An alternative derivation makes use of independence. Recall from Section 2.2 that a
binomial random variable X can be represented as the sum of n independent Bernoulli
trials, each with success probability p. Such a Bernoulli trial Y has variance
E[(Y − E[Y ])2 ] = p(1 − p)2 + (1 − p)(−p)2 = p − p2 = p(1 − p).
By Theorem 3.5, the variance of X is then np(1 − p).

3.3. Chebyshev’s Inequality

Using the expectation and the variance of the random variable, one can derive a signif-
icantly stronger tail bound known as Chebyshev’s inequality.
Theorem 3.6 [Chebyshev’s Inequality]: For any a > 0,
Var[X]
Pr(|X − E[X]| ≥ a) ≤ .
a2

51
moments and deviations

Proof: We first observe that

Pr(|X − E[X]| ≥ a) = Pr((X − E[X])2 ≥ a2 ).
Since (X − E[X])2 is a nonnegative random variable, we can apply Markov’s inequality
to prove:
E[(X − E[X])2 ] Var[X]
Pr((X − E[X])2 ≥ a2 ) ≤ = .
a2 a2
The following useful variants of Chebyshev’s inequality bound the deviation of the ran-
dom variable from its expectation in terms of a constant factor of its standard deviation
or expectation.
Corollary 3.7: For any t > 1,
1
Pr(|X − E[X]| ≥ t · σ [X]) ≤ and
t2
Var[X]
Pr(|X − E[X]| ≥ t · E[X]) ≤ 2 .
t (E[X])2
Let us again consider our coin-flipping example, and this time use Chebyshev’s inequal-
ity to bound the probability of obtaining more than 3n/4 heads in a sequence of n
Recall that Xi = 1 if the ith coin flip is heads and 0 otherwise, and
fair coin flips.
that X = ni=1 Xi denotes the number of heads in the n coin flips. To use Cheby-
shev’s inequality we need to compute the variance of X. Observe first that, since Xi is a
0–1 random variable,
1
E[(Xi )2 ] = E[Xi ] = .
2
Thus,
1 1 1
Var[Xi ] = E[(Xi )2 ] − (E[Xi ])2 = − = .
2 4 4
n
Now, since X = i=1 Xi and the Xi are independent, we can use Theorem 3.5 to
compute
n n
n
Var[X] = Var Xi = Var[Xi ] = .
i=1 i=1
4

Applying Chebyshev’s inequality then yields

Pr(X ≥ 3n/4) ≤ Pr(|X − E[X]| ≥ n/4)
Var[X]
≤
(n/4)2
n/4
=
(n/4)2
4
= .
n

52
3.3 chebyshev’s inequality

In fact, we can do slightly better. Chebyshev’s inequality yields that 4/n is actu-
ally a bound on the probability that X is either smaller than n/4 or larger than 3n/4,
so by symmetry the probability that X is greater than 3n/4 is actually 2/n. Cheby-
shev’s inequality gives a significantly better bound than Markov’s inequality for large
n.

3.3.1. Example: Coupon Collector’s Problem

We apply Markov’s and Chebyshev’s inequalities to the coupon collector’s prob-
lem. Recall that the time X to collect n coupons has expectation nHn , where Hn =
n
i=1 1/n = ln n + O(1). Hence Markov’s inequality yields

1
Pr(X ≥ 2nHn ) ≤ .
2
To use Chebyshev’s inequality,
we need to find the variance of X. Recall again from
Section 2.4.1 that X = ni=1 Xi , where the Xi are geometric random variables with
parameter (n − i + 1)/n. In this case, the Xi are independent because the time to col-
lect the ith coupon does not depend on how long it took to collect the previous i − 1
coupons. Hence

n n
Var[X] = Var Xi = Var[Xi ],
i=1 i=1

so we need to find the variance of a geometric random variable.

Let Y be a geometric random variable with parameter p. As we saw in Section 2.4,
E[X] = 1/p. We calculate E[Y 2 ]. The following trick proves useful. We know that, for
0 < x < 1,

∞
1
= xi .
1−x i=0

Taking derivatives, we find:

∞
1
= ixi−1
(1 − x)2 i=0
∞

= (i + 1)xi ;
i=0
∞

2
= i(i − 1)xi−2
(1 − x)3 i=0
∞

= (i + 1)(i + 2)xi .
i=0

53
moments and deviations

We can conclude that

∞
∞

ix =
2 i
i2 xi
i=1 i=0
∞
∞
∞

= (i + 1)(i + 2)x − 3 i
(i + 1)x +
i
xi
i=0 i=0 i=0
2 1 1
= −3 +
(1 − x)3 (1 − x)2 (1 − x)
x2 + x
= .
(1 − x)3

We now use this to find

∞

E[Y 2 ] = p(1 − p)i−1 i2
i=1
∞
p
= (1 − p)i i2
1 − p i=1
p (1 − p)2 + (1 − p)
=
1− p p3
2− p
= .
p2

Finally, we reach

Var[Y ] = E[Y 2 ] − E[Y ]2

2− p 1
= 2
− 2
p p
1− p
= .
p2

We have just proven the following useful lemma.

Lemma 3.8: The variance of a geometric random variable with parameter p is

(1 − p)/p2 .

For a geometric random variable Y, E[Y 2 ] can also be derived using conditional expec-
tations. We use that Y corresponds to the number of flips until the first heads, where
each flip is heads with probability p. Let X = 0 if the first flip is tails and X = 1 if the
first flip is heads. By Lemma 2.5,

E[Y 2 ] = Pr(X = 0)E[Y 2 | X = 0] + Pr(X = 1)E[Y 2 | X = 1]

= (1 − p)E[Y 2 | X = 0] + pE[Y 2 | X = 1].

54
3.4 median and mean

If X = 1, then Y = 1 and so E[Y 2 | X = 1] = 1. If X = 0, then Y > 1. In this case,

let the number of remaining flips after the first flip until the first head be Z. Then
E[Y 2 ] = (1 − p)E[(Z + 1)2 ] + p · 1
= (1 − p)E[Z 2 ] + 2(1 − p)E[Z] + 1 (3.2)
by the linearity of expectations. By the memoryless property of geometric random vari-
ables, Z is also a geometric random variable with parameter p. Hence E[Z] = 1/p and
E[Z 2 ] = E[Y 2 ]. Plugging these values into Eqn. (3.2), we have
2(1 − p) 2− p
E[Y 2 ] = (1 − p)E[Y 2 ] + + 1 = (1 − p)E[Y 2 ] + ,
p p
which yields E[Y 2 ] = (2 − p)/p2 , matching our other derivation.
We return now to the question of the variance in the coupon collector’s problem.
We simplify the argument by using the upper bound Var[Y ] ≤ 1/p2 for a geometric
random variable, instead of the exact result of Lemma 3.8. Then
n n 2 n 2
n 1 π 2 n2
Var[X] = Var[Xi ] ≤ =n 2
≤ .
i=1 i=1
n−i+1 i=1
i 6
Here we have used the identity
∞ 2
1 π2
= .
i=1
i 6
Now, by Chebyshev’s inequality,

n2 π 2 /6 π2 1
Pr(|X − nHn | ≥ nHn ) ≤ = =O .
(nHn )2 6(Hn )2 ln2 n
In this case, Chebyshev’s inequality again gives a much better bound than Markov’s
inequality. But it is still a fairly weak bound, as we can see by considering instead a
fairly simple union bound argument.
Consider the probability of not obtaining the ith coupon after n ln n + cn steps. This
probability is

1 n(ln n+c) 1
1− < e−(ln n+c) = c .
n en
By a union bound, the probability that some coupon has not been collected after n ln n +
cn steps is only e−c . In particular, the probability that all coupons are not collected after
2n ln n steps is at most 1/n, a bound that is significantly better than what can be achieved
even with Chebyshev’s inequality.

3.4. Median and Mean

Let X be a random variable. The median of X is defined to be any value m such that
Pr(X ≤ m) ≥ 1/2 and Pr(X ≥ m) ≥ 1/2.
55
moments and deviations

For example, for a discrete random variable that is uniformly distributed over an odd
number of distinct, sorted values x1 , x2 , . . . , x2k+1 , the median is the middle value xk+1 .
For a discrete random variable that is uniformly distributed over an even number of
values x1 , x2 , . . . , x2k , any value in the range (xk , xk+1 ) would be a median.
The expectation E[X] and the median are usually different numbers. For distribu-
tions with a unique median that are symmetric around either the mean or median, the
median is equal to the mean. For some distributions, the median can be easier to work
with than the mean, and in some settings it is a more natural quantity to work with.
The following theorem gives an alternate characterization of the mean and median:

Theorem 3.9: For any random variable X with finite expectation E[X] and finite
median m,

1. the expectation E[X] is the value of c that minimizes the expression

E[(X − c)2 ],

and
2. the median m is a value of c that minimizes the expression

E[|X − c|].

Proof: The first result follows from linearity of expectations.

E[(X − c)2 ] = E[X 2 ] − 2cE[X] + c2 ,

and taking the derivative with respect to c shows that c = E[X] yields the minimum.
For the second result, we want to show that that for any value c that is not a
median and for any median m, we have E[|X − c|] > E[|X − m|], or equivalently that
E[|X − c| − |X − m|] > 0. In that case the value of c that minimizes E[|X − c|] will
be a median. (In fact, as a by-product, we show that for any two medians m and m ,
E[|X − m|] = E[|X − m |].)
Let us take the case where c > m for a median m, and c is not a median, so Pr(X ≥
c) < 1/2. A similar argument holds for any value of c such that Pr(X ≤ c) < 1/2.
For x ≥ c, |x − c| − |x − m| = m − c. For m < x < c, |x − c| − |x − m| = c + m −
2x > m − c. Finally, for x ≤ m, |x − c| − |x − m| = c − m. Combining the three cases,
we have

E[|X − c| − |X − m|]

= Pr(X ≥ c)(m − c) + Pr(X = x)(c + m − 2x) + Pr(X ≤ m)(c − m).
x:m<x<c

We now consider two cases. If Pr(m < X < c) = 0, then

E[|X − c| − |X − m|] = Pr(X ≥ c)(m − c) + Pr(X ≤ m)(c − m)

1 1
> (m − c) + (c − m)
2 2
= 0,
56
3.5 application: a randomized algorithm for computing the median

where the inequality comes from Pr(X ≥ c) < 1/2 and m < c. (Note here that if c were
another median, so Pr(X ≥ c) = 1/2, we would obtain E[|X − c| − |X − m|] = 0, as
stated earlier.)
If Pr(m < X < c) = 0, then
E[|X − c| − |X − m|]

= Pr(X ≥ c)(m − c) + Pr(X = x)(c + m − 2x) + Pr(X ≤ m)(c − m)
x:m<x<c
> Pr(X > m)(m − c) + Pr(X ≤ m)(c − m)
1 1
> (m − c) + (c − m)
2 2
= 0,
where here the first inequality comes from c + m − 2x > m − c for any value of x
with non-zero probability in the range m < x < c. (This case cannot hold if c and m
are both medians, as in this case we cannot have Pr(X ≥ m) = 1/2 and Pr(X ≥ c) =
1/2.)
Interestingly, for well-behaved random variables, the median and the mean cannot
deviate from each other too much.
Theorem 3.10: If X is a random variable with finite standard deviation σ , expectation
μ, and median m, then
|μ − m| ≤ σ.
Proof: The proof follows from the following sequence:
|μ − m| = |E[X] − m|
= |E[X − m]|
≤ E[|X − m|]
≤ E[|X − μ|]

≤ E[(X − μ)2 ]
= σ.
Here the first inequality follows from Jensen’s inequality, the second inequality follows
from the result that the median minimizes E[|X − c|], and the third inequality is again
Jensen’s inequality.
In Exercise 3.19, we suggest another way of proving this result.

3.5. Application: A Randomized Algorithm for Computing the

Median

Given a set S of n elements drawn from a totally ordered universe, the median of S is
an element m of S such that at least n/2 elements in S are less than or equal to m and
at least n/2 + 1 elements in S are greater than or equal to m. If the elements in S are
57
moments and deviations

distinct, then m is the (n/2)th element in the sorted order of S. Note that the median
of a set is similar to but slightly different from the the median of a random variable
defined in Section 3.4.
The median can be easily found deterministically in O(n log n) steps by sorting,
and there is a relatively complex deterministic algorithm that computes the median in
O(n) time. Here we analyze a randomized linear time algorithm that is significantly
simpler than the deterministic one and yields a smaller constant factor in the linear
running time. To simplify the presentation, we assume that n is odd and that the ele-
ments in the input set S are distinct. The algorithm and analysis can be easily modified
to include the case of a multi-set S (see Exercise 3.24) and a set with an even number of
elements.

3.5.1. The Algorithm

The main idea of the algorithm involves sampling, which we first discussed in Sec-
tion 1.2. The goal is to find two elements that are close together in the sorted order of S
and that have the median lie between them. Specifically, we seek two elements d, u ∈ S
such that:
1. d ≤ m ≤ u (the median m is between d and u); and
2. for C = {s ∈ S : d ≤ s ≤ u}, |C| = o(n/ log n) (the total number of elements
between d and u is small).
Sampling gives us a simple and efficient method for finding two such elements.
We claim that, once these two elements are identified, the median can easily be
found in linear time with the following steps. The algorithm counts (in linear time) the
number d of elements of S that are smaller than d and then sorts (in sublinear, or o(n),
time) the set C. Notice that, since |C| = o(n/ log n), the set C can be sorted in time o(n)
using any standard sorting algorithm that requires O(m log m) time to sort m elements.
The ( n/2 − d + 1)th element in the sorted order of C is m, since there are exactly
n/2 elements in S that are smaller than that value ( n/2 − d in the set C and d in
S − C).
To find the elements d and u, we sample with replacement a multi-set R of n3/4
elements from S. Recall that sampling with replacement means each element in R is
chosen uniformly at random from the set S, independent of previous choices. Thus, the
same element of S might appear more than once in the multi-set R. Sampling without
replacement might give marginally better bounds, but both implementing and analyzing
it are significantly harder. It is worth noting that we assume that an element can be
sampled from S in constant time.
Since R is a random sample of S we expect m, the median element of S, to be close to
the median element of R. We therefore choose d and u to be elements of R surrounding
the median of R.
We require all the steps of our algorithm to work with high probability, by which we
mean with probability at least 1 − O(1/nc ) for some constant c > 0. To guarantee that
with high probability the set C includes the median m, we fix d and u to be respectively
√ √
the n3/4 /2 − nth and the n3/4 /2 + nth elements in the sorted order of R. With
58
3.5 application: a randomized algorithm for computing the median

Randomized Median Algorithm:

Input: A set S of n elements over a totally ordered universe.

Output: The median element of S, denoted by m.
1. Pick a (multi-)set R of n3/4 elements in S, chosen independently and uniformly
at random with replacement.
2. Sort the set R.
√
3. Let d be the 12 n3/4 − n th smallest element in the sorted set R.
√
4. Let u be the 12 n3/4 + n th smallest element in the sorted set R.
5. By comparing every element in S to d and u, compute the set
C = {x ∈ S : d ≤ x ≤ u} and the numbers d = |{x ∈ S : x < d}| and
u = |{x ∈ S : x > u}|.
6. If d > n/2 or u > n/2 then FAIL.
7. If |C| ≤ 4n3/4 then sort the set C, otherwise FAIL.
8. Output the ( n/2 − d + 1)th element in the sorted order of C.

Algorithm 3.1: Randomized median algorithm.

√
this choice, the set C includes all the elements of S that are between the 2 n sample
points surrounding the median of R. The analysis will clarify that the choice of the size
of R and the choices for d and u are tailored to guarantee both that (a) the set C is large
enough to include m with high probability and (b) the set C is sufficiently small so that
it can be sorted in sublinear time with high probability.
A formal description of the procedure is presented as Algorithm 3.1. In what follows,
√
for convenience we treat n and n3/4 as integers.

3.5.2. Analysis of the Algorithm

Based on our previous discussion, we first prove that – regardless of the random choices
made throughout the procedure – the algorithm (a) always terminates in linear time and
(b) outputs either the correct result or FAIL.
Theorem 3.11: The randomized median algorithm terminates in linear time, and if it
does not output FAIL then it outputs the correct median element of the input set S.
Proof: Correctness follows because the algorithm could only give an incorrect answer
if the median were not in the set C. But then either d > n/2 or u > n/2 and thus
step 6 of the algorithm guarantees that, in these cases, the algorithm outputs FAIL.
Similarly, as long as C is sufficiently small, the total work is only linear in the size of
S. Step 7 of the algorithm therefore guarantees that the algorithm does not take more
than linear time; if the sorting might take too long, the algorithm outputs FAIL without
sorting.

The interesting part of the analysis that remains after Theorem 3.11 is bounding the
probability that the algorithm outputs FAIL. We bound this probability by identifying
59
moments and deviations

three “bad” events such that, if none of these bad events occurs, the algorithm does not
fail. In a series of lemmas, we then bound the probability of each of these events and
show that the sum of these probabilities is only O(n−1/4 ).
Consider the following three events:
√
E 1 : Y1 = |{r ∈ R | r ≤ m}| < 12 n3/4 − n;
√
E2 : Y2 = |{r ∈ R | r ≥ m}| < 12 n3/4 − n;
E3 : |C| > 4n3/4 .

Lemma 3.12: The randomized median algorithm fails if and only if at least one of E1 ,
E2 , or E3 occurs.

Proof: Failure in step 7 of the algorithm is equivalent to the event E3 . Failure in step
6 of the algorithm occurs if and only if d > n/2 or u > n/2. But for d > n/2, the
√
1 3/4
2
n − n th smallest element of R must be larger than m; this is equivalent to the
event E1 . Similarly, u > n/2 is equivalent to the event E2 .

Lemma 3.13:
1 −1/4
Pr(E1 ) ≤ n .
4
Proof: Define a random variable Xi by

1 if the ith sample is less than or equal to the median,
Xi =
0 otherwise.

The Xi are independent, since the sampling is done with replacement. Because there are
(n − 1)/2 + 1 elements in S that are less than or equal to the median, the probability
that a randomly chosen element of S is less than or equal to the median can be written as

(n − 1)/2 + 1 1 1
Pr(Xi = 1) = = + .
n 2 2n
The event E1 is equivalent to

3/4
n
1 3/4 √
Y1 = Xi < n − n.
i=1
2

Since Y1 is the sum of Bernoulli trials, it is a binomial random variable with para-
meters n3/4 and 1/2 + 1/2n. Hence, using the result of Section 3.2.1 yields

3/4 1 1 1 1
Var[Y1 ] = n + −
2 2n 2 2n
1 1
= n3/4 − 5/4
4 4n
1 3/4
< n .
4
60
3.5 application: a randomized algorithm for computing the median

Applying Chebyshev’s inequality then yields

1 √
Pr(E1 ) = Pr Y1 < n3/4 − n
2
√
≤ Pr |Y1 − E[Y1 ]| > n
Var[Y1 ]
≤
n
1 3/4
n 1
< 4 = n−1/4 .
n 4
We similarly obtain the same bound for the probability of the event E2 . We now bound
the probability of the third bad event, E3 .
Lemma 3.14:
1 −1/4
Pr(E3 ) ≤ n .
2
Proof: If E3 occurs, so |C| > 4n3/4 , then at least one of the following two events occurs:
E3,1 : at least 2n3/4 elements of C are greater than the median;
E3,2 : at least 2n3/4 elements of C are smaller than the median.
Let us bound the probability that the first event occurs; the second will have the same
bound by symmetry. If there are at least 2n3/4 elements of C above the median, then
the order of u in the sorted order of S was at least 12 n + 2n3/4 and thus the set R has at
√
least 12 n3/4 − n samples among the 12 n − 2n3/4 largest elements in S.
Let X be the number of samples among the 12 n − 2n3/4 largest elements in S. Let
3/4
X = ni=1 Xi , where

1 if the ith sample is among the 12 n − 2n3/4 largest elements in S,
Xi =
0 otherwise.
Again, X is a binomial random variable, and we find
1 3/4 √
E[X] = n −2 n
2
and

1 1 1 1
Var[X] = n 3/4
− 2n−1/4 + 2n−1/4
= n3/4 − 4n1/4 < n3/4 .
2 2 4 4
Applying Chebyshev’s inequality yields
√
Pr(E3,1 ) = Pr X ≥ 12 n3/4 − n (3.3)
√ Var[X] 1 3/4
n 1 −1/4
≤ Pr |X − E[X]| ≥ n ≤ < 4
= n . (3.4)
n n 4
Similarly,
1 −1/4
Pr(E3,2 ) ≤ n
4
61
moments and deviations

and
1 −1/4
Pr(E3 ) ≤ Pr(E3,1 ) + Pr(E3,2 ) ≤ n .
2

Combining the bounds just derived, we conclude that the probability that the algorithm
outputs FAIL is bounded by
Pr(E1 ) + Pr(E2 ) + Pr(E3 ) ≤ n−1/4 .
This yields the following theorem.

Theorem 3.15: The probability that the randomized median algorithm fails is
bounded by n−1/4 .

By repeating Algorithm 3.1 until it succeeds in finding the median, we can obtain an
iterative algorithm that never fails but has a random running time. The samples taken
in successive runs of the algorithm are independent, so the success of each run is inde-
pendent of other runs, and hence the number of runs until success is achieved is a
geometric random variable. As an exercise, you may wish to show that this variation
of the algorithm (that runs until it finds a solution) still has linear expected running
time.
Randomized algorithms that may fail or return an incorrect answer are called Monte
Carlo algorithms. The running time of a Monte Carlo algorithm often does not depend
on the random choices made. For example, we showed in Theorem 3.11 that the ran-
domized median algorithm always terminates in linear time, regardless of its random
choices.
A randomized algorithm that always returns the right answer is called a Las Vegas
algorithm. We have seen that the Monte Carlo randomized algorithm for the median can
be turned into a Las Vegas algorithm by running it repeatedly until it succeeds. Again,
turning it into a Las Vegas algorithm means the running time is variable, although the
expected running time is still linear.

3.6. Exercises

Exercise 3.1: Let X be a number chosen uniformly at random from [1, n]. Find Var[X].

Exercise 3.2: Let X be a number chosen uniformly at random from [−k, k]. Find
Var[X].

Exercise 3.3: Suppose that we roll a standard fair die 100 times. Let X be the sum
of the numbers that appear over the 100 rolls. Use Chebyshev’s inequality to bound
Pr(|X − 350| ≥ 50).

Exercise 3.4: Prove that, for any real number c and any discrete random variable X,
Var[cX] = c2 Var[X].

62
3.6 exercises

Exercise 3.5: Given any two random variables X and Y, by the linearity of expecta-
tions we have E[X − Y ] = E[X] − E[Y ]. Prove that, when X and Y are independent,
Var[X − Y ] = Var[X] + Var[Y ].

Exercise 3.6: For a coin that comes up heads independently with probability p on each
flip, what is the variance in the number of flips until the kth head appears?

Exercise 3.7: A simple model of the stock market suggests that, each day, a stock with
price q will increase by a factor r > 1 to qr with probability p and will fall to q/r with
probability 1 − p. Assuming we start with a stock with price 1, find a formula for the
expected value and the variance of the price of the stock after d days.

Exercise 3.8: Suppose that we have an algorithm that takes as input a string of n
bits. We are told that the expected running time is O(n2 ) if the input bits are chosen
independently and uniformly at random. What can Markov’s inequality tell us about
the worst-case running time of this algorithm on inputs of size n?
n
Exercise 3.9: (a) Let X be the sum of Bernoulli random variables, X = i=1 Xi . The
Xi do not need to be independent. Show that

n
E[X 2 ] = Pr(Xi = 1)E[X | Xi = 1]. (3.5)
i=1

Hint: Start by showing that

n
E[X 2 ] = E[Xi X],
i=1

and then apply conditional expectations.

(b) Use Eqn. (3.5) to provide another derivation for the variance of a binomial ran-
dom variable with parameters n and p.

Exercise 3.10: For a geometric random variable X, find E[X 3 ] and E[X 4 ]. (Hint: Use
Lemma 2.5.)

Exercise 3.11: Recall the Bubblesort algorithm of Exercise 2.22. Determine the vari-
ance of the number of inversions that need to be corrected by Bubblesort.

Exercise 3.12: Find an example of a random variable with finite expectation and
unbounded variance. Give a clear argument showing that your choice has these
properties.

Exercise 3.13: Find an example of a random variable with finite jth moments for
1 ≤ j ≤ k but an unbounded (k + 1)th moment. Give a clear argument showing that
your choice has these properties.

63
moments and deviations

Exercise 3.14: Prove that, for any finite collection of random variables X1 , X2 , . . . , Xn ,
n n n
Var Xi = Var[Xi ] + 2 Cov(Xi , X j ).
i=1 i=1 i=1 j>i

n Let the random variable X be representable as a sum of random vari-

Exercise 3.15:
ables X = i=1 Xi . Show that,
if E[Xi X j ] = E[Xi ]E[X j ] for every pair of i and j with
1 ≤ i < j ≤ n, then Var[X] = ni=1 Var[Xi ].

Exercise 3.16: This problem shows that Markov’s inequality is as tight as it could
possibly be. Given a positive integer k, describe a random variable X that assumes only
nonnegative values such that
1
Pr(X ≥ kE[X]) = .
k
Exercise 3.17: Can you give an example (similar to that for Markov’s inequality in
Exercise 3.16) that shows that Chebyshev’s inequality is tight? If not, explain why not.

Exercise 3.18: Show that, for a random variable X with standard deviation σ [X] and
any positive real number t:
1
(a) Pr(X − E[X] ≥ tσ [X]) ≤ ;
1 + t2
2
(b) Pr(|X − E[X]| ≥ tσ [X]) ≤ .
1 + t2

Exercise 3.19: Using Exercise 3.18, show that |μ − m| ≤ σ for a random variable
with finite standard deviation σ , expectation μ, and median m.

Exercise 3.20: Let Y be a nonnegative integer-valued random variable with positive

expectation. Prove
E[Y ]2
≤ Pr[Y = 0] ≤ E[Y ].
E[Y 2 ]

Exercise 3.21: (a) Chebyshev’s inequality uses the variance of a random variable to
bound its deviation from its expectation. We can also use higher moments. Suppose
that we have a random variable X and an even integer k for which E[(X − E[X])k ] is
finite. Show that
1
Pr |X − E[X]| > t k E[(X − E[X])k ] ≤ k .
t
(b) Why is it difficult to derive a similar inequality when k is odd?

Exercise 3.22: A fixed point of a permutation π [1, n] → [1, n] is a value for which
π (x) = x. Find the variance in the number of fixed points of a permutation chosen
uni-
formly at random from all permutations. (Hint: Let Xi be 1 if π (i) = i, so that ni=1 Xi
64
3.6 exercises

n
is the number of fixed points. You cannot use linearity to find Var i=1 Xi , but you
can calculate it directly.)

Exercise 3.23: Suppose that we flip a fair coin n times to obtain n random bits. Con-
sider all m = n2 pairs ofthese bits in some order. Let Yi be the exclusive-or of the ith
pair of bits, and let Y = mi=1 Yi be the number of Yi that equal 1.

(a) Show that each Yi is 0 with probability 1/2 and 1 with probability 1/2.
(b) Show that the Yi are not mutually independent.
(c) Show that the Yi satisfy the property that E[YiY j ] = E[Yi ]E[Y j ].
(d) Using Exercise 3.15, find Var[Y ].
(e) Using Chebyshev’s inequality, prove a bound on Pr(|Y − E[Y ]| ≥ n).

Exercise 3.24: Generalize the median-finding algorithm for the case where the input
S is a multi-set. Bound the error probability and the running time of the resulting
algorithm.

Exercise 3.25: Generalize the median-finding algorithm to find the kth largest item in
a set of n items for any given value of k. Prove that your resulting algorithm is correct,
and bound its running time.

Exercise 3.26: The weak law of large numbers states that, if X1 , X2 , X3 , . . . are inde-
pendent and identically distributed random variables with mean μ and standard devia-
tion σ , then for any constant ε > 0 we have

X1 + X2 + · · · + Xn
lim Pr
− μ > ε = 0.
n→∞ n
Use Chebyshev’s inequality to prove the weak law of large numbers.

65
chapter four
Chernoff and Hoeffding Bounds

This chapter introduces large deviation bounds commonly called Chernoff and
Hoeffding bounds. These bounds are extremely powerful, giving exponentially
decreasing bounds on the tail distribution. These bounds are derived by applying
Markov’s inequality to the moment generating function of a random variable. We start
this chapter by defining and discussing the properties of the moment generating func-
tion. We then derive Chernoff bounds for the binomial distribution and other related
distributions, using a set balancing problem as an example, and the Hoeffding bound
for sums of bounded random variables. To demonstrate the power of Chernoff bounds,
we apply them to the analysis of randomized packet routing schemes on the hypercube
and butterfly networks.

4.1. Moment Generating Functions

Before developing Chernoff bounds, we discuss the special role of the moment gener-
ating function E[etX ].
Definition 4.1: The moment generating function of a random variable X is
MX (t ) = E[etX ].
We are mainly interested in the existence and properties of this function in the neigh-
borhood of zero.
The function MX (t ) captures all of the moments of X.
Theorem 4.1: Let X be a random variable with moment generating function MX (t ).
Under the assumption that exchanging the expectation and differentiation operands is
legitimate, for all n > 1 we then have
E[X n ] = MX(n) (0),

where MX(n) (0) is the nth derivative of MX (t ) evaluated at t = 0.

66
4.1 moment generating functions

Proof: Assuming that we can exchange the expectation and differentiation operands,
then
MX(n) (t ) = E[X n etX ].
Computed at t = 0, this expression yields
MX(n) (0) = E[X n ].
The assumption that expectation and differentiation operands can be exchanged holds
whenever the moment generating function exists in a neighborhood of zero, which will
be the case for all distributions considered in this book.
As a specific example, consider a geometric random variable X with parameter p, as
in Definition 2.8. Then, for t < − ln(1 − p),
MX (t ) = E[etX ]
∞
= (1 − p)k−1 petk
k=1
∞
p
= (1 − p)k etk
1 − p k=1
p
= ((1 − (1 − p)et )−1 − 1).
1− p
It follows that
MX(1) (t ) = p(1 − (1 − p)et )−2 et and
MX(2) (t ) = 2p(1 − p)(1 − (1 − p)et )−3 e2t + p(1 − (1 − p)et )−2 et .
Evaluating these derivatives at t = 0 and using Theorem 4.1 gives E[X] = 1/p
and E[X 2 ] = (2 − p)/p2 , matching our previous calculations from Section 2.4 and
Section 3.3.1.
Another useful property is that the moment generating function of a random variable
(or, equivalently, all of the moments of the variable) uniquely defines its distribution.
However, the proof of the following theorem is beyond the scope of this book.
Theorem 4.2: Let X and Y be two random variables. If
MX (t ) = MY (t )
for all t ∈ (−δ, δ) for some δ > 0, then X and Y have the same distribution.
One application of Theorem 4.2 is in determining the distribution of a sum of indepen-
dent random variables.
Theorem 4.3: If X and Y are independent random variables, then
MX+Y (t ) = MX (t )MY (t ).
Proof:
MX+Y (t ) = E[et(X+Y ) ] = E[etX etY ] = E[etX ]E[etY ] = MX (t )MY (t ).
67
chernoff and hoeffding bounds

Here we have used that X and Y are independent – and hence etX and etY are indepen-
dent – to conclude that E[etX etY ] = E[etX ]E[etY ].

Thus, if we know MX (t ) and MY (t ) and if we recognize the function MX (t )MY (t ) as

the moment generating function of a known distribution, then that must be the distribu-
tion of X + Y when Theorem 4.2 applies. We will see examples of this in subsequent
sections and in the exercises.

4.2. Deriving and Applying Chernoff Bounds

The Chernoff bound for a random variable X is obtained by applying Markov’s inequal-
ity to etX for some well-chosen value t. From Markov’s inequality, we can derive the
following useful inequality: for any t > 0,
E[etX ]
Pr(X ≥ a) = Pr(etX ≥ eta ) ≤ .
eta
In particular,
E[etX ]
Pr(X ≥ a) ≤ min .
t>0 eta
Similarly, for any t < 0,
E[etX ]
Pr(X ≤ a) = Pr(etX ≥ eta ) ≤ .
eta
Hence
E[etX ]
Pr(X ≤ a) ≤ min .
t<0 eta
Bounds for specific distributions are obtained by choosing appropriate values for t.
While the value of t that minimizes E[etX ]/eta gives the best possible bounds, often one
chooses a value of t that gives a convenient form. Bounds derived from this approach
are generally referred to collectively as Chernoff bounds. When we speak of a Chernoff
bound for a random variable, it could actually be one of many bounds derived in this
fashion.

4.2.1. Chernoff Bounds for the Sum of Poisson Trials

We now develop the most commonly used version of the Chernoff bound: for the tail
distribution of a sum of independent 0–1 random variables, which are also known as
Poisson trials. (Poisson trials differ from Poisson random variables, which will be dis-
cussed in Section 5.3.) The distributions of the random variables in Poisson trials are
not necessarily identical. Bernoulli trials are a special case of Poisson trials where the
independent 0–1 random variables have the same distribution; in other words, all trials
are Poisson trials that take on the value 1 with the same probability. Also recall that
the binomial distribution gives the number of successes in n independent Bernoulli
68
4.2 deriving and applying chernoff bounds

trials. Our Chernoff bound will hold for the binomial distribution and also for the more
general setting of the sum of Poisson trials.
, . . . , Xn be a sequence of independent Poisson trials with Pr(Xi = 1) = pi .
Let X1
Let X = ni=1 Xi , and let
n
n n
μ = E[X] = E Xi = E[Xi ] = pi .
i=1 i=1 i=1

For a given δ > 0, we are interested in bounds on Pr(X ≥ (1 + δ)μ) and Pr(X ≤
(1 − δ)μ) – that is, the probability that X deviates from its expectation μ by δμ or more.
To develop a Chernoff bound we need to compute the moment generating function of
X. We start with the moment generating function of each Xi :
MXi (t ) = E[etXi ]
= pi et + (1 − pi )
= 1 + pi (et − 1)
≤ e pi (e −1) ,
t

where in the last inequality we have used the fact that, for any y, 1 + y ≤ ey . Applying
Theorem 4.3, we take the product of the n generating functions to obtain

n
MX (t ) = MXi (t )
i=1

n
e pi (e −1)
t
≤
i=1
!

n
= exp pi (e − 1)
t

i=1
(et −1)μ
=e .
Now that we have determined a bound on the moment generating function, we are
ready to develop concrete versions of the Chernoff bound for a sum of Poisson trials.
We start with bounds on the deviation above the mean.
Theorem4.4: Let X1 , . . . , Xn be independent Poisson trials such that Pr(Xi = 1) = pi .
Let X = ni=1 Xi and μ = E[X]. Then the following Chernoff bounds hold:
1. for any δ > 0,
μ
eδ
Pr(X ≥ (1 + δ)μ) ≤ ; (4.1)
(1 + δ)(1+δ)
2. for 0 < δ ≤ 1,
Pr(X ≥ (1 + δ)μ) ≤ e−μδ /3
2
; (4.2)
3. for R ≥ 6μ,
Pr(X ≥ R) ≤ 2−R . (4.3)
69
chernoff and hoeffding bounds

The first bound of the theorem is the strongest, and it is from this bound that we derive
the other two bounds, which have the advantage of being easier to state and compute
with in many situations.
Proof: Applying Markov’s inequality, for any t > 0 we have
Pr(X ≥ (1 + δ)μ) = Pr(etX ≥ et(1+δ)μ )
E[etX ]
≤ t(1+δ)μ
e
e(e −1)μ
t

≤ t(1+δ)μ .
e
For any δ > 0, we can set t = ln(1 + δ) > 0 to get Eqn. (4.1):
μ
eδ
Pr(X ≥ (1 + δ)μ) ≤ .
(1 + δ)(1+δ)
To obtain Eqn. (4.2) we need to show that, for 0 < δ ≤ 1,
eδ
≤ e−δ /3 .
2

(1 + δ)(1+δ)
Taking the logarithm of both sides, we obtain the equivalent condition
δ2
f (δ) = δ − (1 + δ) ln(1 + δ) + ≤ 0.
3
Computing the derivatives of f (δ), we have:
1+δ 2
f (δ) = 1 − − ln(1 + δ) + δ
1+δ 3
2
= − ln(1 + δ) + δ;
3
1 2
f (δ) = − + .
1+δ 3
We see that f (δ) < 0 for 0 ≤ δ < 1/2 and that f (δ) > 0 for δ > 1/2. Hence f (δ)
first decreases and then increases over the interval [0, 1]. Since f (0) = 0 and f (1) <
0, we can conclude that f (δ) ≤ 0 in the interval [0, 1]. Since f (0) = 0, it follows that
f (δ) ≤ 0 in that interval, proving Eqn. (4.2).
To prove Eqn. (4.3), let R = (1 + δ)μ. Then, for R ≥ 6μ, δ = R/μ − 1 ≥ 5. Hence,
using Eqn. (4.1),
μ
eδ
Pr(X ≥ (1 + δ)μ) ≤
(1 + δ)(1+δ)
(1+δ)μ
e
≤
1+δ
e R
≤
6
−R
≤2 .

70
4.2 deriving and applying chernoff bounds

We obtain similar results bounding the deviation below the mean.

Theorem4.5: Let X1 , . . . , Xn be independent Poisson trials such that Pr(Xi = 1) = pi .
Let X = ni=1 Xi and μ = E[X]. Then, for 0 < δ < 1:
μ
1. e−δ
Pr(X ≤ (1 − δ)μ) ≤ ; (4.4)
(1 − δ)(1−δ)
Pr(X ≤ (1 − δ)μ) ≤ e−μδ /2
2
2. . (4.5)
Again, the bound of Eqn. (4.4) is stronger than Eqn. (4.5), but the latter is generally
easier to use and sufficient in most applications.
Proof: Using Markov’s inequality, for any t < 0 we have
Pr(X ≤ (1 − δ)μ) = Pr(etX ≥ et(1−δ)μ )
E[etX ]
≤ t(1−δ)μ
e
e(e −1)μ
t

≤ t(1−δ)μ .
e
For 0 < δ < 1, we set t = ln(1 − δ) < 0 to get Eqn. (4.4):
μ
e−δ
Pr(X ≤ (1 − δ)μ) ≤ .
(1 − δ)(1−δ)
To prove Eqn. (4.5) we must show that, for 0 < δ < 1,
e−δ
≤ e−δ /2 .
2

(1 − δ)(1−δ)
Taking the logarithm of both sides, we obtain the equivalent condition
δ2
f (δ) = −δ − (1 − δ) ln(1 − δ) + ≤0
2
for 0 < δ < 1.
Differentiating f (δ) yields
f (δ) = ln(1 − δ) + δ,
1
f (δ) = − + 1.
1−δ
Since f (δ) < 0 in the range (0, 1) and since f (0) = 0, we have f (δ) ≤ 0 in the range
[0, 1). Therefore, f (δ) is nonincreasing in that interval. Since f (0) = 0, it follows that
f (δ) ≤ 0 when 0 < δ < 1, as required.
Often the following form of the Chernoff bound, which is derived immediately from
Eqn. (4.2) and Eqn. (4.4), is used.

Let X1 , . . . , Xn be independent Poisson trials such that Pr(Xi = 1) =

Corollary 4.6:
pi . Let X = ni=1 Xi and μ = E[X]. For 0 < δ < 1,

Pr(|X − μ| ≥ δμ) ≤ 2e−μδ /3

2
. (4.6)
71
chernoff and hoeffding bounds

In practice we often do not have the exact value of E[X]. Instead we can use μ ≥ E[X]
in Theorem 4.4 and μ ≤ E[X] in Theorem 4.5 (see Exercise 4.7).

4.2.2. Example: Coin Flips

Let X be the number of heads in a sequence of n independent fair coin flips. Applying
the Chernoff bound of Eqn. (4.6), we have
"
n 1 √ 1 n 6 ln n
Pr X − ≥ 6n ln n ≤ 2 exp −
2 2 32 n
2
= .
n
This demonstrates that the concentration of the number of heads around the mean
n/2
√is very tight; most of the time, the deviations from the mean are on the order of
O n ln n .
To compare the power of this bound to Chebyshev’s bound, consider the probability
of having no more than n/4 heads or no fewer than 3n/4 heads in a sequence of n
independent fair coin flips. In the previous chapter, we used Chebyshev’s inequality to
show that

n n 4
Pr X − ≥ ≤ .
2 4 n
Already, this bound is worse than the Chernoff bound just calculated for a significantly
larger event! Using the Chernoff bound in this case, we find that
"

n n 1n1
Pr X − ≥ ≤ 2 exp −
2 4 324
≤ 2e−n/24 .
Thus, Chernoff’s technique gives a bound that is exponentially smaller than the bound
obtained using Chebyshev’s inequality.

4.2.3. Application: Estimating a Parameter

Suppose that we are interested in evaluating the probability that a particular gene muta-
tion occurs in the population. Given a DNA sample, a lab test can determine if it carries
the mutation. However, the test is expensive and we would like to obtain a relatively
reliable estimate from a small number of samples.
Let p be the unknown value that we are trying to estimate. Assume that we have
n samples and that X = p̃n of these samples have the mutation. Given a sufficiently
large number of samples, we expect the value p to be close to the sampled value p̃. We
express this intuition using the concept of a confidence interval.
Definition 4.2: A 1 − γ confidence interval for a parameter p is an interval [ p̃ −
δ, p̃ + δ] such that
Pr(p ∈ [ p̃ − δ, p̃ + δ]) ≥ 1 − γ .
72
4.3 better bounds for some special cases

Notice that, instead of predicting a single value for the parameter, we give an interval
that is likely to contain the parameter. If p can take on any real value, it may not make
sense to try to pin down its exact value from a finite sample, but it does make sense to
estimate it within some small range.
Naturally we want both the interval size 2δ and the error probability γ to be as
small as possible. We derive a trade-off between these two parameters and the number
of samples n. In particular, given that among n samples (chosen uniformly at random
from the entire population) we find the mutation in exactly X = p̃n samples, we need
to find values of δ and γ for which

Pr(p ∈ [ p̃ − δ, p̃ + δ]) = Pr(np ∈ [n( p̃ − δ), n( p̃ + δ)]) ≥ 1 − γ .

Now X = n p̃ has a binomial distribution with parameters n and p, so E[X] = np. If

p∈
/ [ p̃ − δ, p̃ + δ] then we have one of the following two events:

1. if p < p̃ − δ, then X = n p̃ > n(p + δ) = E[X](1 + δ/p);

2. if p > p̃ + δ, then n p̃ < n(p − δ) = E[X](1 − δ/p).

We can apply the Chernoff bounds in Eqns. (4.2) and (4.5) to compute

δ δ
Pr(p ∈
/ [ p̃ − δ, p̃ + δ]) = Pr X < np 1 − + Pr X > np 1 + (4.7)
p p
< e−np(δ/p) /2 + e−np(δ/p) /3
2 2
(4.8)
−nδ 2 /2p −nδ 2 /3p
=e +e . (4.9)

The bound given in Eqn. (4.9) is not useful because the value of p is unknown. A
simple solution is to use the fact that p ≤ 1, yielding

/ [ p̃ − δ, p̃ + δ]) < e−nδ /2

+ e−nδ /3
2 2
Pr(p ∈ .

Setting γ = e−nδ /2 + e−nδ /3 , we obtain a trade-off between δ, n, and the error proba-
2 2

bility γ .
We can apply other Chernoff bounds, such as those in Exercises 4.13 and 4.16, to
obtain better bounds. We return to the subject of parameter estimation when we discuss
the Monte Carlo method in Chapter 11.

4.3. Better Bounds for Some Special Cases

We can obtain stronger bounds using a simpler proof technique for some special cases
of symmetric random variables.
We consider first the sum of independent random variables when each variable
assumes the value 1 or −1 with equal probability.

Theorem 4.7: Let X1 , . . . , Xn be independent random variables with

1
Pr(Xi = 1) = Pr(Xi = −1) = .
2
73
chernoff and hoeffding bounds

n
Let X = i=1 Xi . For any a > 0,

Pr(X ≥ a) ≤ e−a /2n .

Proof: For any t > 0,

1 t 1 −t
E[etXi ] = e + e .
2 2
To estimate E[etXi ], we observe that
t2 ti
et = 1 + t + + ··· + + ···
2! i!
and
t2 ti
e−t = 1 − t + + · · · + (−1)i + · · · ,
2! i!
using the Taylor series expansion for et . Thus,
1 t 1 −t
E[etXi ] = e + e
2 2
t 2i
=
i≥0
(2i)!
(t 2 /2)i
≤
i≥0
i!
2
/2
= et .
Using this estimate yields

n
2
E[etX ] = E[etXi ] ≤ et n/2

i=1

and
E[etX ] 2
Pr(X ≥ a) = Pr(etX ≥ eta ) ≤ ta
≤ et n/2−ta .
e
Setting t = a/n, we obtain
Pr(X ≥ a) ≤ e−a /2n .
2

By symmetry we also have

Pr(X ≤ −a) ≤ e−a /2n .
2

Combining the two results yields our next corollary.

Corollary 4.8: Let X1 , . . . , Xn be independent random variables with
1
Pr(Xi = 1) = Pr(Xi = −1) = .
2
74
4.3 better bounds for some special cases

n
Let X = i=1 Xi . Then, for any a > 0,

Pr(|X| ≥ a) ≤ 2e−a /2n .

Applying the transformation Yi = (Xi + 1)/2 allows us to prove the following.

Corollary 4.9: Let Y1 , . . . , Yn be independent random variables with

1
Pr(Yi = 1) = Pr(Yi = 0) = .
2
n
Let Y = i=1 Yi and μ = E[Y ] = n/2.

1. For any a > 0,

Pr(Y ≥ μ + a) ≤ e−2a /n .
2

2. For any δ > 0,

Pr(Y ≥ (1 + δ)μ) ≤ e−δ μ .

2
(4.10)

Proof: Using the notation of Theorem 4.7, we have

n
n
1 n 1
Y = Yi = Xi + = X + μ.
i=1
2 i=1 2 2

Applying Theorem 4.7 yields

Pr(Y ≥ μ + a) = Pr(X ≥ 2a) ≤ e−4a /2n ,

proving the first part of the corollary. The second part follows from setting a = δμ =
δn/2. Again applying Theorem 4.7, we have

Pr(Y ≥ (1 + δ)μ) = Pr(X ≥ 2δμ) ≤ e−2δ μ2 /n

= e−δ μ .
2 2

Note that the constant in the exponent of the bound of Eqn. (4.10) is 1 instead of the
1/3 in the bound of Eqn. (4.2).
Similarly, we have the following result.

Corollary 4.10: Let Y1 , . . . , Yn be independent random variables with

1
Pr(Yi = 1) = Pr(Yi = 0) = .
2
n
Let Y = i=1 Yi and μ = E[Y ] = n/2.

1. For any 0 < a < μ,

Pr(Y ≤ μ − a) ≤ e−2a /n .
2

2. For any 0 < δ < 1,

Pr(Y ≤ (1 − δ)μ) ≤ e−δ μ .

2
(4.11)

75
chernoff and hoeffding bounds

4.4. Application: Set Balancing

Given an n × m matrix A with entries in {0, 1}, let

⎛ ⎞⎛ ⎞ ⎛ ⎞
a11 a12 · · · a1m b1 c1
⎜a ⎟ ⎜ ⎟ ⎜ ⎟
⎜ 21 a22 · · · a2m ⎟ ⎜ b2 ⎟ ⎜c2 ⎟
⎜ . .. ⎟ ⎜ .. ⎟ ⎜ .. ⎟
⎟ ⎜ ⎟ = ⎜
⎜ . .. .. ⎟.
⎝ . . . . ⎠⎝ . ⎠ ⎝ . ⎠
an1 an2 · · · anm bm cn

Suppose that we are looking for a vector b̄ with entries in {−1, 1} that minimizes

Ab̄∞ = max |ci |.

i=1,...,n

This problem arises in designing statistical experiments. Each column of the matrix A
represents a subject in the experiment and each row represents a feature. The vector
b̄ partitions the subjects into two disjoint groups, so that each feature is roughly as
balanced as possible between the two groups. One of the groups serves as a control
group for an experiment that is run on the other group.
Our randomized algorithm for computing a vector b̄ is extremely simple. We ran-
domly choose the entries of b̄, with Pr(bi = 1) = Pr(bi = −1) = 1/2. The choices
for different entries are independent. Surprisingly, although this algorithm ignores the
entries
√ of the matrix A, the following theorem shows that Ab̄∞ is likely to be only
O m ln n . This bound is fairly tight. In Exercise 4.15 you √are asked to show that,
when m = n, there exists a matrix A for which Ab̄∞ is n for any choice of b̄.

Theorem 4.11: For a random vector b̄ with entries chosen independently and with
equal probability from the set {−1, 1},
√ 2
Pr Ab̄∞ ≥ 4m ln n ≤ .
n
Proof: Consider
√ the ith row āi = ai,1 , . . . , ai,m , and
√ let k be the number of 1s in that
row.
√ If k ≤ 4m ln n, then clearly |āi · b̄| = |ci | ≤ 4m ln n. On the other hand, if k >
4m ln n then we note that the k nonzero terms in the sum

m
Zi = ai, j b j
j=1

are independent random variables, each with probability 1/2 of being either +1 or −1.
Now using the Chernoff bound of Corollary 4.8 and the fact that m ≥ k,
√ 2
Pr |Zi | > 4m ln n ≤ 2e−4m ln n/2k ≤ 2 .
n
By the union bound, the probability that the bound fails for any row is at most
2/n.

76
4.5 the hoeffding bound

4.5. The Hoeffding Bound

Hoeffding’s bound extends the Chernoff bound technique to general random variables
with a bounded range.
Theorem 4.12 [Hoeffding Bound]: Let X1 , . . . , Xn be independent random variables
such that for all 1 ≤ i ≤ n, E[Xi ] = μ and Pr(a ≤ Xi ≤ b) = 1. Then
n
1

Xi − μ ≥ ≤ 2e−2n /(b−a) .
2 2
Pr
n
i=1

Proof: The proof relies on the following bound for the moment generating function,
which we prove first.
Lemma 4.13 [Hoeffding’s Lemma]: Let X be a random variable such that
Pr(X ∈ [a, b]) = 1 and E[X] = 0. Then for every λ > 0,
E[eλX ] ≤ eλ (b−a)2 /8
2
.
Proof: Before beginning, note that since E[X] = 0, if a = 0 then b = 0 and the state-
ment is trivial. Hence we may assume a < 0 and b > 0.
Since f (x) = eλx is a convex function, for any α ∈ (0, 1),
f (αa + (1 − α)b) ≤ αeλa + (1 − α)eλb .
For x ∈ [a, b], let α = b−x
b−a
; then x = αa + (1 − α)b and we have
b − x λa x − a λb
eλx ≤ e + e .
b−a b−a
We consider eλX and take expectations. Using the fact that E[X] = 0, we have

b − X λa X − a λb
E[eλX ] ≤ E e +E e
b−a b−a
b λa E[X] λa a λb E[X] λb
= e − e − e + e
b−a b−a b−a b−a
b λa a λb
= e − e .
b−a b−a
We now require some manipulation of this final expression. Let φ(t ) = −θt +
−a
ln(1 − θ + θet ), for θ = b−a > 0. Then

eφ(λ(b−a)) = e−θλ(b−a) (1 − θ + θeλ(b−a) )

= eλa (1 − θ + θeλ(b−a) )

λa b a λ(b−a)
=e − e
b−a b−a
b λa a λb
= e − e ,
b−a b−a
which equals the upper bound we derived for E[eλX ]. It is not hard to verify that φ(0) =
φ (0) = 0, and φ (t) ≤ 1/4 for all t. By Taylor’s theorem, for any t > 0 there is a
77
chernoff and hoeffding bounds

t ∈ [0, t] such that

1 1
φ(t) = φ(0) + tφ (0) + t 2 φ (t ) ≤ t 2 .
2 8
Thus, for t = λ(b − a), we have

λ2 (b − a)2
φ(λ(b − a)) ≤ .
8
It follows that

E[eλX ] ≤ eφ(λ(b−a)) ≤ eλ (b−a)2 /8

2
.

We now return to the proof of Theorem 4.12. Let Zi = Xi − E[Xi ] and Z =

1 n
n i=1 Zi .
For any λ > 0, by Markov’s inequality,

n
λZ λ −λ λZ −λ
Pr(Z ≥ ) = Pr(e ≥e )≤e E[e ] ≤ e E[eλZi /n ]
i=1

n
≤ e−λ eλ (b−a)2 /n2
≤ e−λ+λ (b−a)2 /8n
2 2
,
i=1

where for the key second to last inequality we have used Hoeffding’s Lemma with the
fact that Zi /n is bounded between (a − μ)/n and (b − μ)/n. Setting λ = (b−a)
4n
2 gives

1
n
= Pr(Z ≥ ) ≤ e−2n /(b−a)2
2
Pr Xi − μ ≥ .
n i=1

Applying the same argument for Pr(Z ≤ −) with λ = − (b−a)

4n
2 gives

1
n
= Pr(Z ≤ −) ≤ e−2n /(b−a)2
2
Pr Xi − μ ≤ − .
n i=1

Applying a union bound on the two cases gives the theorem.

The proof of the following more general version of the bound is left as an exercise
(Exercise 4.20).

Theorem 4.14: Let X1 , . . . , Xn be independent random variables with E[Xi ] = μi and

Pr(ai ≤ Xi ≤ bi ) = 1 for constants ai and bi . Then
n
n n

μi ≥ ≤ 2e−2 / i=1 (bi −ai ) .
2 2
Pr Xi −

i=1 i=1

Note that Theorem 4.12 bounds the deviation of the average of the n random vari-
ables while Theorem 4.14 bounds the deviation of the sum of the variables.
78
∗
4.6 application: packet routing in sparse networks

Examples:
1. Consider n independent random variables X1 , . . . , Xn such that Xi is uniformly dis-
tributed in {0, . . . , }. For all i, μ = E[Xi ] = /2, and
n
1

Xi − ≥ ≤ 2e−2n / .
2 2
Pr
n 2
i=1

In particular,
n
1

Xi − μ ≥ δμ ≤ 2e−nδ /2 .
2
Pr
n
i=1

2. Consider n independent random

variables Y1 , . . . , Yn such that Yi is uniformly
dis-
tributed in {0, i}. Let Y = ni=1 Yi . Then E[Yi ] = i/2, and μ = E[Y ] = ni=1 i/2 =
n(n + 1)/4. Applying Theorem 4.14 with ci = i we have

n(n + 1) n 2
Pr Y − ≤ 2e−2 / i=1 ci = 2e−2 /(n(n+1)(2n+1)/6)
2 2

≥
4
= 2e−12 /(n(n+1)(2n+1))
2
.
We can conclude
Pr(|Y − μ| ≥ δμ) ≤ 2e−12δ n (n+1)2 /(16n(n+1)(2n+1))
≤ 2e−3nδ /8
2 2 2
.

4.6.∗ Application: Packet Routing in Sparse Networks

A fundamental problem in parallel computing is how to communicate efficiently over

sparse communication networks. We model a communication network by a directed
graph on N nodes. Each node is a routing switch. A directed edge models a commu-
nication channel, which connects two adjacent routing switches. We consider a syn-
chronous computing model in which (a) an edge can carry one packet in each time step
and (b) a packet can traverse no more than one edge per step. We assume that switches
have buffers or queues to store packets waiting for transmission through each of the
switch’s outgoing edges.
Given a network topology, a routing algorithm specifies, for each pair of nodes, a
route – or a sequence of edges – connecting the pair in the network. The algorithm
may also specify a queuing policy for ordering packets in the switches’ queues. For
example, the First In First Out (FIFO) policy orders packets by their order of arrival.
The Furthest To Go (FTG) policy orders packets in decreasing order of the number of
edges they must still cross in the network.
Our measure of the performance of a routing algorithm on a given network topology
is the maximum time – measured as the number of parallel steps – required to route an
arbitrary permutation routing problem, where each node sends exactly one packet and
each node is the address of exactly one packet.
Of course, routing a permutation can be done in just one parallel step if the net-
work is a complete graph connecting all of the nodes to each other. Practical consider-
ations, however, dictate that a network for a large-scale parallel machine must be sparse.
79
chernoff and hoeffding bounds

Each node can be connected directly to only a few neighbors, and most packets must
traverse intermediate nodes en route to their final destination. Since an edge may be
on the path of more than one packet and since each edge can process only one packet
per step, parallel packet routing on sparse networks may lead to congestion and bottle-
necks. The practical problem of designing an efficient communication scheme for par-
allel computers leads to an interesting combinatorial and algorithmic problem: design-
ing a family of sparse networks connecting any number of processors, together with
a routing algorithm that routes an arbitrary permutation request in a small number of
parallel steps.
We discuss here a simple and elegant randomized routing technique and then use
Chernoff bounds to analyze its performance on the hypercube network and the butterfly
network. We first analyze the case of routing a permutation on a hypercube, a network
with N processors and O(N log N) edges. We then present a tighter argument for the
butterfly network, which has N nodes and only O(N) edges.

4.6.1. Permutation Routing on the Hypercube

Let N = {0 ≤ i ≤ N − 1} be the set of processors in our parallel machine and assume
that N = 2n for some integer n. Let x̄ = (x1 , . . . , xn ) be the binary representation of the
number 0 ≤ x ≤ N − 1.

Definition 4.3: The n-dimensional hypercube (or n-cube) is a network with N = 2n

nodes such that node x has a direct connection to node y if and only if x̄ and ȳ differ in
exactly one bit.

See Figure 4.1. Note that the total number of directed edges in the n-cube is nN, since
each node is adjacent to n outgoing and n ingoing edges. Also, the diameter of the
network is n; that is, there is a directed path of length up to n connecting any two
nodes in the network, and there are pairs of nodes that are not connected by any shorter
path.
The topology of the hypercube allows for a simple bit-fixing routing mechanism, as
shown in Algorithm 4.1. When determining which edge to cross next, the algorithm
simply considers each bit in order and crosses the edge if necessary.
Although it seems quite natural, using only the bit-fixing routes can lead to high
levels of congestion and poor performance, as shown in Exercise 4.22. There are certain
permutations on which the bit-fixing routes behave poorly. It turns out, as we will show,
that these routes perform well if each packet is being sent from a source to a destination
chosen uniformly at random. This motivates the following approach: first route each
packet to a randomly chosen intermediate point, and then route it from this intermediate
point to its final destination.
It may seem unusual to first route packets to a random intermediate point. In some
sense, this is similar in spirit to our analysis of Quicksort in Section 2.5. We found there
that for a list already sorted in reverse order, Quicksort would take (n2 ) comparisons,
whereas the expected number of comparisons for a randomly chosen permutation is
only O(n log n). Randomizing the data can lead to a better running time for Quicksort.
80
∗
4.6 application: packet routing in sparse networks

000 010

0 00 10 100 110

001 011

1 01 11 101 111

(a) n = 1. (b) n = 2. (c) n = 3.

0000 0010 1000 1010

0100 0110 1100 1110

0001 0011 1001 1011

0101 0111 1101 1111

(d) n = 4.

Figure 4.1: Hypercubes of dimensions 1, 2, 3, and 4.

n-Cube Bit-Fixing Routing Algorithm:

1. Let ā and b̄ be the origin and the destination of the packet.
2. For i = 1 to n, do:
(a) If ai = bi then traverse the edge (b1 , . . . , bi−1 , ai , . . . , an ) →
(b1 , . . . , bi−1 , bi , ai+1 , . . . , an ).

Algorithm 4.1: n-Cube bit-fixing routing algorithm.

Here, too, randomizing the routes that packets take – by routing them through a ran-
dom intermediate point – avoids bad initial permutations and leads to good expected
performance.
The two-phase routing algorithm (Algorithm 4.2) is executed in parallel by all the
packets. The random choices are made independently for each packet. Our analysis
holds for any queueing policy that obeys the following natural requirement: if a queue
is not empty at the beginning of a time step, some packet is sent along the edge associ-
ated with that queue during that time step. We prove that this routing strategy achieves
asymptotically optimal parallel time.
81
chernoff and hoeffding bounds

Two-Phase Routing Algorithm:

Phase I – Route the packet to a randomly chosen node in the network using the
bit-fixing route.
Phase II – Route the packet from its random location to its final destination using
the bit-fixing route.

Algorithm 4.2: Two-phase routing algorithm.

Theorem 4.15: Given an arbitrary permutation routing problem, with probability

1 − O(N −1 ) the two-phase routing scheme of Algorithm 4.2 routes all packets to their
destinations on the n-cube in O(n) = O(log N) parallel steps.

Proof: We first analyze the run-time of Phase I. To simplify the analysis we assume that
no packet starts the execution of Phase II before all packets have finished the execution
of Phase I. We show later that this assumption can be removed.
We emphasize a fact that we use implicitly throughout. If a packet is routed to a
randomly chosen node x̄ in the network, we can think of x̄ = (x1 , . . . , xn ) as being
generated by setting each xi independently to be 0 with probability 1/2 and 1 with
probability 1/2.
For a given packet M, let T1 (M) be the number of steps for M to finish Phase I. For
a given edge e, let X1 (e) denote the total number of packets that traverse edge e during
Phase I.
In each step of executing Phase I, packet M is either traversing an edge or waiting in a
queue while some other packet traverses an edge on M’s route. This simple observation
relates the routing time of M to the total number of packet transitions through edges on
the path of M, as follows.

Lemma 4.16: Let e1 , . . . , em be the m ≤ n edges traversed by a packet M in Phase I.

Then

m
T1 (M) ≤ X1 (ei ).
i=1

Let us call any path P = (e1 , e2 , . . . , em ) of m ≤ n edges that follows the bit-fixing
algorithm a possible packet path. We denote the corresponding nodes by v0 , v1 , . . . , vm
with ei = (vi−1 , vi ). Following the definition of T1 (M), for any possible packet path P
we let

m
T1 (P) = X1 (ei ).
i=1

By Lemma 4.16, the probability that Phase I takes more than T steps is bounded by
the probability that, for some possible packet path P, T1 (P) ≥ T . Note that there are
at most 2n · 2n = 22n possible packet paths, since there are 2n possible origins and 2n
possible destinations.
82
∗
4.6 application: packet routing in sparse networks

To prove the theorem, we need a high-probability bound on T1 (P). Since T1 (P)

equals the summation ni=1 X1 (ei ), it would be natural to try to use a Chernoff bound.
The difficulty here is that the X1 (ei ) are not independent random variables, since a
packet that traverses an edge is likely to traverse one of its adjacent edges. To circum-
vent this difficulty, we first use a Chernoff bound to prove that, with high probability, no
more than 6n different packets cross any edge of P. We then condition on this event to
derive a high-probability bound on the total number of transitions these packets make
through edges of the path P, again using a Chernoff bound.1
Let us now fix a specific possible packet path P with m edges. To obtain a high-
probability bound on the number of packets that cross an edge of P, let us call a packet
active at a node vi−1 on the path P if it reaches vi−1 and has the possibility of crossing
edge ei to vi . That is, if vi−1 and vi differ in the jth bit then – in order for a packet to
be active at vi−1 – its jth bit cannot have been fixed by the bit-fixing algorithm when it
reaches vi−1 . We may also call a packet active if it is active at some vertex on the path
P. We bound the total number of active packets.
For k = 1, . . . , N, let Hk be a 0–1 random variable such that Hk = 1 if the packet
starting at node k is active and Hk = 0 otherwise. Notice that the Hk are independent
because (a) each Hk depends only on the choice of the intermediate destination of the
packetstarting at node k and (b) these choices are independent for all packets. Let
H = Nk=1 Hk be the total number of active packets.
We first bound E[H]. Consider all the active packets at vi−1 . Assume that vi−1 =
(b1 , . . . , b j−1 , a j , a j+1 , . . . , an ) and vi = (b1 , . . . , b j−1 , b j , a j+1 , . . . , an ). Then only
packets that start at one of the addresses (∗, . . . , ∗, a j , . . . , an ), where ∗ stands for
either a 0 or a 1, can reach vi−1 before the jth bit is fixed. Similarly, each of these
packets actually reaches vi−1 only if its random destination is one of the addresses
(b1 , . . . , b j−1 , ∗, . . . , ∗). Thus, there are no more than 2 j−1 possible active packets at
vi−1 , and the probability that each of these packets is actually active at vi−1 is 2−( j−1) .
Hence the expected number of active packets per vertex is 1 and, since we need only
consider the m vertices v0 , . . . , vm−1 , it follows by linearity of expectations that
E[H] ≤ m · 1 ≤ n.
Since H is the sum of independent 0–1 random variables, we can apply the Chernoff
bound (we use the bound of Eqn. (4.3)) to prove
Pr(H ≥ 6n ≥ 6E[H]) ≤ 2−6n .
The high-probability bound for H can help us obtain a bound for T1 (P) as follows.
Using
Pr(A) = Pr(A | B) Pr(B) + Pr(A | B̄) Pr(B̄)
≤ Pr(B) + Pr(A | B̄),

1 This approach overestimates the time to finish a phase. In fact, there is a deterministic argument showing that,
in this setting, the delay of a packet on a path is bounded by the number of different packets that traverse edges
of the path, and hence there is no need to bound the total number of traversals of these packets on the path.
However, in the spirit of this book we prefer to present the probabilistic argument.

83
chernoff and hoeffding bounds

we find for a given possible packet path P that

Pr(T1 (P) ≥ 30n) ≤ Pr(H ≥ 6n) + Pr(T1 (P) ≥ 30n | H < 6n)
≤ 2−6n + Pr(T1 (P) ≥ 30n | H < 6n).

Hence if we show

Pr(T1 (P) ≥ 30n | H < 6n) ≤ 2−3n−1 ,

we then have

Pr(T1 (P) ≥ 30n) ≤ 2−3n ,

which proves sufficient for our purposes.

We therefore need to bound the conditional probability Pr(T1 (P) ≥ 30n | H ≤ 6n).
In other words, conditioning on having no more than 6n active packets that might use
edges of P, we need a bound on the total number of transitions that these packets take
through edges of P.
We first observe that, if a packet leaves the path, it cannot return to that path in this
phase of the routing algorithm. Indeed, assume that the active packet was at vi and that
it moved to w = vi+1 . The smallest index bit in which vi+1 and w differ cannot be fixed
later in this phase, so the route of the packet and the path P cannot meet again in this
phase.
Now suppose we have an active packet on our path P at node vi . What is the prob-
ability that the packet crosses ei ? Let us think of our packet as fixing the bits in the
binary representation of its destination one at a time by independent random coin flips.
The nodes of the edge ei differ in one bit (say, the jth bit) in this representation. It is
therefore clear that the probability of the packet crossing edge ei is at most 1/2, since
to cross this edge it must choose the appropriate value for the jth bit. (In fact, the prob-
ability might be less than 1/2; the packet might cross some other edge before choosing
the value of the jth bit.)
To obtain our bound, let us view as a trial each point in the algorithm where an
active packet at a node vi on the path P might cross edge ei . The trial is successful if
the packet leaves the path but a failure if the packet stays on the path. Since the packet
leaves the path on a successful trial, if there are at most 6n active packets then there
can be at most 6n successes. Each trial is successful, independently, with probability at
least 1/2. The number of trials is itself a random variable, which we use in our bound
of T1 (P).
We claim that the probability that the active packets cross edges of P more than 30n
times is less than the probability that a fair coin flipped 36n times comes up heads fewer
than 6n times. To see this, think of a coin being flipped for each trial, with heads corre-
sponding to a success. The coin is biased to come up heads with the proper probability
for each trial, but this probability is always at least 1/2 and the coins are independent
for each trial. Each failure (tails) corresponds to an active packet crossing an edge, but
once there have been 6n successes we know there are no more active packets left that
can cross an edge of the path. Using a fair coin instead of a coin possibly biased in
favor of success can only lessen the probability that the active packets cross edges of
84
∗
4.6 application: packet routing in sparse networks

P more than 30n times, as can be shown easily by induction (on the number of biased
coins).
Letting Z be the number of heads in 36n fair coin flips, we now apply the Chernoff
bound of Eqn. (4.5) to prove:
Pr(T1 (P) ≥ 30n | H ≤ 6n) ≤ Pr(Z ≤ 6n) ≤ e−18n(2/3) /2 = e−4n ≤ 2−3n−1 .
2

It follows that
Pr(T1 (P) ≥ 30n) ≤ Pr(H ≥ 6n) + Pr(T1 (P) ≥ 30n | H ≤ 6n) ≤ 2−3n ,
as we wanted to show. Because there are at most 22n possible packet paths in the hyper-
cube, the probability that there is any possible packet path for which T1 (P) ≥ 30n is
bounded by
22n 2−3n = 2−n = O(N −1 ).
This completes the analysis of Phase I. Consider now the execution of Phase II,
assuming that all packets completed their Phase I route. In this case, Phase II can be
viewed as running Phase I backwards: instead of packets starting at a given origin
and going to a random destination, they start at a random origin and end at a given
destination. Hence no packet spends more than 30n steps in Phase II with probability
1 − O(N −1 ).
In fact, we can remove the assumption that packets begin Phase II only after Phase
I has completed. The foregoing argument allows us to conclude that the total number
of packet traversals across the edges of any packet path during Phase I and Phase II
together is bounded by 60n with probability 1 − O(N −1 ). Since a packet can be delayed
only by another packet traversing that edge, we find that every packet completes both
Phase I and Phase II after 60n steps with probability 1 − O(N −1 ) regardless of how the
phases interact, concluding the proof of Theorem 4.15
Note that the run-time of the routing algorithm is optimal up to a constant factor, since
the diameter of the hypercube is n. However, the network is not fully utilized because
2nN directed edges are used to route just N packets. At any give time, at most 1/2n of
the edges are actually being used. This issue is addressed in the next section.

4.6.2. Permutation Routing on the Butterfly

In this section we adapt the result for permutation routing on the hypercube networks
to routing on butterfly networks, yielding a significant improvement in network utiliza-
tion. Specifically, our goal in this section is to route a permutation on a network with
N nodes and O(N) edges in O(log N) parallel time steps. Recall that the hypercube
network had N nodes but (N log N) edges. Although the argument will be similar
in spirit to that for the hypercube network, there is some additional complexity to the
argument for the butterfly network.
We work on the wrapped butterfly network, defined as follows.
Definition 4.4: The wrapped butterfly network has N = n2n nodes. The nodes are
arranged in n columns and 2n rows. A node’s address is a pair (x, r), where 1 ≤ x ≤ 2n
85
chernoff and hoeffding bounds

l3
ve

ve
le

le
row 000

row 001

row 010

row 011

row 100

row 101

row 110

row 111

Figure 4.2: The butterfly network. In the wrapped butterfly, levels 0 and 3 are collapsed into one
level.

is the row number and 0 ≤ r ≤ n − 1 is the column number of the node. Node (x, r) is
connected to node (y, s) if and only if s = r + 1 mod n and either:

1. x = y (the “direct” edge); or

2. x and y differ in precisely the sth bit in their binary representation (the “flip” edge).

See Figure 4.2. To see the relation between the wrapped butterfly and the hypercube,
observe that by collapsing the n nodes in each row of the wrapped butterfly into one
“super node” we obtain an n-cube network. Using this correspondence, one can easily
verify that there is a unique directed path of length n connecting node (x, r) to any
other node (w, r) in the same column. This path is obtained by bit fixing: first fixing
bits r + 1 to n, then bits 1 to r. See Algorithm 4.3. Our randomized permutation routing
algorithm on the butterfly consists of three phases, as shown in Algorithm 4.4.
Unlike our analysis of the hypercube, our analysis here cannot simply bound the
number of active packets that possibly traverse edges of a path. Given the path of a
packet, the expected number of other packets that share edges with this path when
86
∗
4.6 application: packet routing in sparse networks

Wrapped Butterfly Bit-Fixing Routing Algorithm:

1. Let (x, r) and (y, r) be the origin and the destination of a packet.
2. For i = 0 to n − 1, do:
(a) j = ((i + r) mod n) + 1;
(b) if a j = b j then traverse the direct edge to column j mod n, else traverse the
flip edge to column j mod n.

Algorithm 4.3: Wrapped butterfly bit-fixing routing algorithm.

Three-Phase Routing Algorithm:

For a packet sent from node (x, r) to node (y, s):

Phase I – Choose a random w ∈ [1, . . . , 2n ]. Route the packet from node (x, r) to
node (w, r) using the bit-fixing route.
Phase II – Route the packet to node (w, s) using direct edges.
Phase III – Route the packet from node (w, s) to node (y, s) using the bit-fixing
route.

Algorithm 4.4: Three-phase routing algorithm.

routing a random permutation on the butterfly network is (n2 ) and not O(n) as in the
n-cube. To obtain an O(n) routing time, we need a more refined analysis technique that
takes into account the order in which packets traverse edges.
Because of this, we need to consider the priority policy that the queues use when
there are several packets waiting to use the edge. A variety of priority policies would
work here; we assume the following rules.

1. The priority of a packet traversing an edge is (i − 1)n + t, where i is the current

phase of the packet and t is the number of edge traversals the packet has already
executed in this phase.
2. If at any step more than one packet is available to traverse an edge, the packet with
the smallest priority number is sent first.

Theorem 4.17: Given an arbitrary permutation routing problem on the wrapped but-
terfly with N = n2n nodes, with probability 1 − O(N −1 ) the three-phase routing scheme
of Algorithm 4.4 routes all packets to their destinations in O(n) = O(log N) parallel
steps.

Proof: The priority rule in the edge queues guarantees that packets in a phase cannot
delay packets in earlier phases. Because of this, in our forthcoming analysis we can
consider the time for each phase to complete separately and then add these times to
bound the total time for the three-phase routing scheme to complete.
We begin by considering the second phase. We first argue that with high probability
each row transmits at most 4n packets in the second phase. To see this, let Xw be the
87
chernoff and hoeffding bounds

number of packets whose intermediate row choice is w in the three-phase routing algo-
rithm. Then Xw is the sum of 0–1 independent random variables, one for each packet,
and E[Xw ] = n. Hence, we can directly apply the Chernoff bound of Eqn. (4.1) to find
3 n
e
Pr(Xw ≥ 4n) ≤ ≤ 3−2n .
44
There are 2n possible rows w. By the union bound, the probability that any row has
more than 4n packets is only 2n · 3−2n = O(N −1 ).
We now argue that, if each row has at most 4n packets for the second phase, then the
second phase takes at most 5n steps to complete. Combined with our previous observa-
tions, this means the second phase takes at most 5n steps with probability 1 − O(N −1 ).
To see this, note that in the second phase the routing has a special structure: each packet
moves from edge to edge along its row. Because of the priority rule, each packet can
be delayed only by packets already in a queue when it arrives. Therefore, to place an
upper bound on the number of packets that delay a packet p, we can bound the total
number of packets found in each queue when p arrives at the queue. But in Phase II, the
number of other packets that an arriving packet finds in a queue cannot increase in size
over time, since at each step a queue sends a packet and receives at most one packet.
(It is worth considering the special case when a queue becomes empty at some point
in Phase II; this queue can receive another packet at some later step, but the number of
packets an arriving packet will find in the queue after that point is always zero.) Since
there are at most 4n packets total in the row to begin with, p finds at most 4n packets
that delay it as it moves from queue to queue. Since each packet moves at most n times
in the second phase, the total time for the phase is 5n steps.
We now consider the other phases. The first and third phases are again the same by
symmetry, so we consider just the first phase. Our analysis will use a delay sequence
argument.

Definition 4.5: A delay sequence for an execution of Phase I is a sequence of n edges

e1 , . . . , en such that either ei = ei+1 or ei+1 is an outgoing edge from the end vertex of
ei . The sequence e1 , . . . , en has the further property that ei is (one of) the last edges to
transmit packets with priority up to i among ei+1 and the two incoming edges of ei+1 .
The relation between the delay sequence and the time for Phase I to complete is given
by the following lemma.
Lemma 4.18: For a given execution of Phase I and delay sequence e1 , . . . , en , let ti be
the number of packets with priority i sent through edge ei . Let Ti be the time that edge ei
finishes sending all packets with priority number up to i, so that Tn is the earliest time
at which all packets passing through en during Phase I have passed through it. Then:

1. Tn ≤ ni=1 ti .
2. If the execution of Phase
I takes T steps, then there is a delay sequence for this
execution for which ni=1 ti ≥ T .
Proof: By the design of the delay sequence, at time Ti the queue of ei+1 already holds
all of the packets that it will need to subsequently transmit with priority i + 1, and at
88
∗
4.6 application: packet routing in sparse networks

that time it has already finished transmitting all packets with priority numbers up to i.
Thus,
Ti+1 ≤ Ti + ti+1 .
Since T1 = t1 , we have
Tn ≤ Tn−1 + tn
≤ Tn−2 + tn−1 + tn
n
≤ ti ,
i=1

proving the first part of the lemma.

For the second part, assume that Phase I took T steps and let e be an edge that trans-
mitted a packet at time T. We can construct a delay sequence with en = e by choosing
en−1 to be the edge among e and its two incoming edges that last transmits packets of
priority
n n − 1, and similarly choosing en−2 down to e1 . By the first part of the lemma,
i=1 ti ≥ T.
Returning to the proof of Theorem 4.17, we now show that the probability of a delay
sequence with T ≥ 40n is only O(N −1 ). We call any sequence of edges e1 , . . . , en such
that either ei = ei+1 or ei+1 is an outgoing edge from the end vertex of ei a possible delay
sequence. For a given execution and a possible delay sequence, let ti be the number of
packets with priority i sent through ei . Let T = ni=1 ti . We first bound E[T ]. Consider
the edge ei = v → v . Packets with priority i pass through this edge only if their source
is at distance i − 1 from v. There are precisely 2i−1 nodes that are connected to v by a
directed path of length i − 1. Since packets are sent in Phase I to random destinations,
the probability that each of these nodes sends a packet that traverses edge ei is 2−i ,
giving
1 n
E[ti ] = 2i−1 2−i = and E[T ] = .
2 2
The motivation for using the delay sequence argument should now be clear. Each
possible delay sequence defines a random variable T, where E[T ] = n/2. The max-
imum of T over all delay sequences bounds the run-time of the phase. So we need
a bound on T that holds with sufficiently high probability to cover all possible delay
sequences. A high-probability bound on T can now be obtained using an argument sim-
ilar to the one used in the proof of Theorem 4.15. We first bound the number of different
packets that contribute to edge traversals counted in T.
For j = 1, . . . , N, let H j = 1 if any traversal of the packet sent by node j is counted

in T ; otherwise, H j = 0. Clearly, H = Nj=1 H j ≤ T and E[H] ≤ E[T ] = n/2, where
the H j are independent random variables. Applying the Chernoff bound of Eqn. (4.3)
therefore yields
Pr(H ≥ 5n) ≤ 2−5n .
Conditioning on the event H ≤ 5n, we now proceed to prove a bound on T, following
the same line as in the proof of Theorem 4.15. Given a packet u with at least one
89
chernoff and hoeffding bounds

traversal counted in T, we consider how many additional traversals of u are counted in

T. Specifically, if u is counted in ti then we consider the probability that it is counted
in ti+1 . We distinguish between two cases as follows.
1. If ei+1 = ei then u cannot be counted in ti+1 , since its traversal with priority i + 1 is
in the next column. Similarly, it cannot be counted in any t j , j > i.
2. If ei+1 = ei , then the probability that u continues through ei+1 (and is counted in
ti+1 ) is at most 1/2. If it does not continue through ei+1 , then it cannot intersect with
the delay sequence in any further traversals in this phase.
As in the proof of Theorem 4.15, the probability that T ≥ 40n is less than the prob-
ability that a fair coin flipped 40n times comes up heads fewer than 5n times. (Keep in
mind that, in this case, the first traversal by each packet in H must be counted as con-
tributing to T.) Letting Z be the number of heads in 40n fair coin flips, we now apply
the Chernoff bound (4.5) to prove
Pr(T ≥ 40n | H ≤ 5n) ≤ Pr(Z ≤ 5n) ≤ e−20n(3/4) /2 ≤ 2−5n .
2

We conclude that
Pr(T ≥ 40n) ≤ Pr(T ≥ 40n | H ≤ 5n) + Pr(H ≥ 5n) ≤ 2−5n+1 .
There are no more than 2N3n−1 ≤ n2n 3n possible delay sequences, since a sequence
can start in any one of the 2N edges of the network, and by Definition 4.5, if ei is the
ith edge in the sequence, there are only three possible assignments for ei+1 . Thus, the
probability that, in the execution of Phase I, there is a delay sequence with T ≥ 40n is
bounded above (using the union bound) by
n2n 3n 2−5n+1 ≤ O(N −1 ).
Since Phase III is entirely similar to Phase I and since Phase II also finishes in O(n)
steps with probability 1 − O(N −1 ), we have that the three-phase routing algorithm fin-
ishes in O(n) steps with probability 1 − O(N −1 ).

4.7. Exercises

Exercise 4.1: Alice and Bob play checkers often. Alice is a better player, so the proba-
bility that she wins any given game is 0.6, independent of all other games. They decide
to play a tournament of n games. Bound the probability that Alice loses the tournament
using a Chernoff bound.

Exercise 4.2: We have a standard six-sided die. Let X be the number of times that a 6
occurs over n throws of the die. Let p be the probability of the event X ≥ n/4. Compare
the best upper bounds on p that you can obtain using Markov’s inequality, Chebyshev’s
inequality, and Chernoff bounds.

Exercise 4.3: (a) Determine the moment generating function for the binomial random
variable B(n, p).
90
4.7 exercises

(b) Let X be a B(n, p) random variable and Y a B(m, p) random variable, where X
and Y are independent. Use part (a) to determine the moment generating function of
X + Y.
(c) What can we conclude from the form of the moment generating function of
X + Y?

Exercise 4.4: Determine the probability of obtaining 55 or more heads when flipping
a fair coin 100 times by an explicit calculation, and compare this with the Chernoff
bound. Do the same for 550 or more heads in 1000 flips.

Exercise 4.5: We plan to conduct an opinion poll to find out the percentage of people
in a community who want its president impeached. Assume that every person answers
either yes or no. If the actual fraction of people who want the president impeached is
p, we want to find an estimate X of p such that
Pr(|X − p| ≤ ε p) > 1 − δ
for a given ε and δ, with 0 < ε, δ < 1.
We query N people chosen independently and uniformly at random from the com-
munity and output the fraction of them who want the president impeached. How large
should N be for our result to be a suitable estimator of p? Use Chernoff bounds, and
express N in terms of p, ε, and δ. Calculate the value of N from your bound if ε = 0.1
and δ = 0.05 and if you know that p is between 0.2 and 0.8.

Exercise 4.6: (a) In an election with two candidates using paper ballots, each vote is
independently misrecorded with probability p = 0.02. Use a Chernoff bound to give
an upper bound on the probability that more than 4% of the votes are misrecorded in
an election of 1,000,000 ballots.
(b) Assume that a misrecorded ballot always counts as a vote for the other candidate.
Suppose that candidate A received 510,000 votes and that candidate B received 490,000
votes. Use Chernoff bounds to upper bound the probability that candidate B wins the
election owing to misrecorded ballots. Specifically, let X be the number of votes for
candidate A that are misrecorded and let Y be the number of votes for candidate B that
are misrecorded. Bound Pr((X > k) ∪ (Y < )) for suitable choices of k and .

Exercise 4.7: Throughout the chapter we implicitly assumed the following extension
of the Chernoff
n bound. Prove that it is true.
Let X = i=1 Xi , where the Xi are independent 0–1 random variables. Let μ =
E[X]. Choose any μL and μH such that μL ≤ μ ≤ μH . Then, for any δ > 0,
μH
eδ
Pr(X ≥ (1 + δ)μH ) ≤ .
(1 + δ)(1+δ)
Similarly, for any 0 < δ < 1,
μL
e−δ
Pr(X ≤ (1 − δ)μL ) ≤ .
(1 − δ)(1−δ)
91
chernoff and hoeffding bounds

Exercise 4.8: We show how to construct a random permutation π on [1, n], given a
black box that outputs numbers independently and uniformly at random from [1, k]
where k ≥ n. If we compute a function f [1, n] → [1, k] with f (i) = f ( j) for i = j,
this yields a permutation; simply output the numbers [1, n] according to the order of
the f (i) values. To construct such a function f, do the following for j = 1, . . . , n: choose
f ( j) by repeatedly obtaining numbers from the black box and setting f ( j) to the first
number found such that f ( j) = f (i) for i < j.
Prove that this approach gives a permutation chosen uniformly at random from all
permutations. Find the expected number of calls to the black box that are needed when
k = n and k = 2n. For the case k = 2n, argue that the probability that each call to the
black box assigns a value of f ( j) to some j is at least 1/2. Based on this, use a Chernoff
bound to bound the probability that the number of calls to the black box is at least 4n.

Exercise 4.9: Suppose that we can obtain independent samples X1 , X2 , . . . of a random

variable X
and that
we want to use these samples to estimate E[X]. Using t samples,
t
we use i=1 Xi /t for our estimate of E[X]. We want the estimate to be within εE[X]
from the true value of E[X] with probability at least 1 − δ. We may not be able to
use Chernoff’s bound directly to bound how good our estimate is if X is not a 0–1
random variable, and we do not know its moment generating function. We develop
an alternative
√ approach that requires only having a bound on the variance of X. Let
r = Var[X]/E[X].

(a) Show using Chebyshev’s inequality that O(r2 /ε2 δ) samples are sufficient to solve
the problem.
(b) Suppose that we need only a weak estimate that is within εE[X] of E[X] with
probability at least 3/4. Argue that O(r2 /ε2 ) samples are enough for this weak
estimate.
(c) Show that, by taking the median of O(log(1/δ)) weak estimates, we can obtain an
estimate within εE[X] of E[X] with probability at least 1 − δ. Conclude that we
need only O((r2 log(1/δ))/ε2 ) samples.

Exercise 4.10: A casino is testing a new class of simple slot machines. Each game, the
player puts in $1, and the slot machine is supposed to return either $3 to the player with
probability 4/25, $100 with probability 1/200, or nothing with all remaining probabil-
ity. Each game is supposed to be independent of other games.
The casino has been surprised to find in testing that the machines have lost $10,000
over the first million games. Derive a Chernoff bound for the probability of this event.
You may want to use a calculator or program to help you choose appropriate values as
you derive your bound.

Exercise 4.11: Consider a collection X1 ,. . . , Xn of n independent integers chosen

uniformly from the set {0, 1, 2}. Let X = ni=1 Xi and 0 < δ < 1. Derive a Chernoff
bound for Pr(X ≥ (1 + δ)n) and Pr(X ≤ (1 − δ)n).

Exercise 4.12: Consider a collection X1 , . . . , Xn

of n independent geometrically dis-
tributed random variables with mean 2. Let X = ni=1 Xi and δ > 0.
92
4.7 exercises

(a) Derive a bound on Pr(X ≥ (1 + δ)(2n)) by applying the Chernoff bound to a

sequence of (1 + δ)(2n) fair coin tosses.
(b) Directly derive a Chernoff bound on Pr(X ≥ (1 + δ)(2n)) using the moment gen-
erating function for geometric random variables.
(c) Which bound is better?

4.13: Let X1 , . . . , Xn be independent Poisson trials such that Pr(Xi = 1) = p.

Exercise
Let X = ni=1 Xi , so that E[X] = pn. Let
F (x, p) = x ln(x/p) + (1 − x) ln((1 − x)/(1 − p)).
(a) Show that, for 1 ≥ x > p,
Pr(X ≥ xn) ≤ e−nF (x,p) .
(b) Show that, when 0 < x, p < 1, we have F (x, p) − 2(x − p)2 ≥ 0. (Hint: Take the
second derivative of F (x, p) − 2(x − p)2 with respect to x.)
(c) Using parts (a) and (b), argue that
Pr(X ≥ (p + ε)n) ≤ e−2nε .
2

(d) Use symmetry to argue that

Pr(X ≤ (p − ε)n) ≤ e−2nε ,
2

and conclude that

Pr(|X − pn| ≥ εn) ≤ 2e−2nε .
2

Exercise 4.14: Modify the proof of Theorem 4.4 to show the following bound for
a weighted sum of Poisson trials. Let X1 , . . . , Xn be independent Poisson trials such
that Pr(Xi ) = pi and let a1 , . . . , an be real numbers in [0, 1]. Let X = ni=1 ai Xi and
μ = E[X]. Then the following Chernoff bound holds: for any δ > 0,
μ
eδ
Pr(X ≥ (1 + δ)μ) ≤ .
(1 + δ)(1+δ)
Prove a similar bound for the probability that X ≤ (1 − δ)μ for any 0 < δ < 1.

Exercise 4.15: Let X1 , . . . , Xn be independent random variables such that

Pr(Xi = 1 − pi ) = pi and Pr(Xi = −pi ) = 1 − pi .
n
Let X = i=1 Xi . Prove

Pr(|X| ≥ a) ≤ 2e−2a /n .
2

Hint: You may need to assume the inequality

pi eλ(1−pi ) + (1 − pi )e−λpi ≤ eλ /8 .
2

This inequality is difficult to prove directly.

93
chernoff and hoeffding bounds

Exercise 4.16: Let X1 , . . . , Xn be independent Poisson trials such that Pr(Xi = 1) =

pi . Let X = ni=1 ai Xi and μ = E[X]. Use the result of Exercise 4.15 to prove that if
|ai | ≤ 1 for all 1 ≤ i ≤ n, then for any 0 < δ < 1,
Pr(|X − μ| ≥ δμ) ≤ 2e−2δ μ2 /n
2
.

Exercise 4.17: Suppose that we have n jobs to distribute among m processors. For
simplicity, we assume that m divides n. A job takes 1 step with probability p and k > 1
steps with probability 1 − p. Use Chernoff bounds to determine upper and lower
bounds (that hold with high probability) on when all jobs will be completed if we
randomly assign exactly n/m jobs to each processor.

Exercise 4.18: In many wireless communication systems, each receiver listens on a

specific frequency. The bit b(t) sent at time t is represented by a 1 or −1. Unfortunately,
noise from other nearby communications can affect the receiver’s signal. A simplified
model of this noise is as follows. There are n other senders, and the ith has strength
pi ≤ 1. At any time t, the ith sender is also trying to send a bit bi (t ) that is represented
by 1 or −1. The receiver obtains the signal s(t ) given by

n
s(t) = b(t) + pi bi (t).
i=1
If s(t) is closer to 1 than −1, the receiver assumes that the bit sent at time t was a 1;
otherwise, the receiver assumes that it was a −1.
Assume that all the bits bi (t) can be considered independent, uniform random vari-
ables. Give a Chernoff bound to estimate the probability that the receiver makes an
error in determining b(t).

Exercise 4.19: Recall that a function f is said to be convex if, for any x1 , x2 and for
0 ≤ λ ≤ 1,
f (λx1 + (1 − λ)x2 ) ≤ λ f (x1 ) + (1 − λ) f (x2 ).
(a) Let Z be a random variable that takes on a (finite) set of values in the interval [0, 1],
and let p = E[Z]. Define the Bernoulli random variable X by Pr(X = 1) = p and
Pr(X = 0) = 1 − p. Show that E[ f (Z)] ≤ E[ f (X )] for any convex function f.
(b) Use the fact that f (x) = etx is convex for any t ≥ 0 to obtain a Chernoff bound for
the sum of n independent random variables with distribution Z as in part (a), based
on a Chernoff bound for independent Poisson trials.

Exercise 4.20: Prove Theorem 4.14.

Exercise 4.21: We prove that the Randomized Quicksort algorithm sorts a set of
n numbers in time O(n log n) with high probability. Consider the following view of
Randomized Quicksort. Every point in the algorithm where it decides on a pivot ele-
ment is called a node. Suppose the size of the set to be sorted at a particular node is s.
The node is called good if the pivot element divides the set into two parts, each of size
not exceeding 2s/3. Otherwise the node is called bad. The nodes can be thought of as
94
4.7 exercises

forming a tree in which the root node has the whole set to be sorted and its children
have the two sets formed after the first pivot step and so on.

(a) Show that the number of good nodes in any path from the root to a leaf in this tree
is not greater than c log2 n, where c is some positive constant.
(b) Show that, with high probability (greater than 1 − 1/n2 ), the number of nodes in
a given root to leaf path of the tree is not greater than c log2 n, where c is another
constant.
(c) Show that, with high probability (greater than 1 − 1/n), the number of nodes in
the longest root to leaf path is not greater than c log2 n. (Hint: How many nodes
are there in the tree?)
(d) Use your answers to show that the running time of Quicksort is O(n log n) with
probability at least 1 − 1/n.

Exercise 4.22: Consider the bit-fixing routing algorithm for routing a permutation on
the n-cube. Suppose that n is even. Write each source node s as the concatenation of
two binary strings as and bs each of length n/2. Let the destination of s’s packet be
bs and
the concatenation of√ as . Show that this permutation causes the bit-fixing routing
algorithm to take N steps.

Exercise 4.23: Consider the following modification to the bit-fixing routing algorithm
for routing a permutation on the n-cube. Suppose that, instead of fixing the bits in order
from 1 to n, each packet chooses a random order (independent of other packets’ choices)
and fixes the bits in that order. Show that there is a permutation for which this algorithm
requires 2(n) steps with high probability.

Exercise 4.24: Assume that we use the randomized routing algorithm for the n-cube
network (Algorithm 4.2) to route a total of up to p2n packets, where each node is the
source of no more than p packets and each node is the destination of no more than p
packets.

(a) Give a high-probability bound on the run-time of the algorithm.

(b) Give a high-probability bound on the maximum number of packets at any node at
any step of the execution of the routing algorithm.

Exercise 4.25: Show that the expected number of packets that traverse any edge on
the path of a given packet when routing a random permutation on the wrapped butterfly
network of N = n2n nodes is (n2 ).

Exercise 4.26: In this exercise, we design a randomized algorithm for the following
packet routing problem. We are given a network that is an undirected connected graph
G, where nodes represent processors and the edges between the nodes represent wires.
We are also given a set of N packets to route. For each packet we are given a source
node, a destination node, and the exact route (path in the graph) that the packet should
take from the source to its destination. (We may assume that there are no loops in the
95
chernoff and hoeffding bounds

path.) In each time step, at most one packet can traverse an edge. A packet can wait at
any node during any time step, and we assume unbounded queue sizes at each node.
A schedule for a set of packets specifies the timing for the movement of packets
along their respective routes. That is, it specifies which packet should move and which
should wait at each time step. Our goal is to produce a schedule for the packets that
tries to minimize the total time and the maximum queue size needed to route all the
packets to their destinations.
(a) The dilation d is the maximum distance traveled by any packet. The congestion c is
the maximum number of packets that must traverse a single edge during the entire
course of the routing. Argue that the time required for any schedule should be at
least (c + d).
(b) Consider the following unconstrained schedule, where many packets may traverse
an edge during a single time step. Assign each packet an integral delay chosen ran-
domly, independently, and uniformly from the interval [1, αc/ log(Nd)], where α
is a constant. A packet that is assigned a delay of x waits in its source node for x time
steps; then it moves on to its final destination through its specified route without
ever stopping. Give an upper bound on the probability that more than O(log(Nd))
packets use a particular edge e at a particular time step t.
(c) Again using the unconstrained schedule of part (b), show that the probability that
more than O(log(Nd)) packets pass through any edge at any time step is at most
1/(Nd) for a sufficiently large α.
(d) Use the unconstrained schedule to devise a simple randomized algorithm that, with
high probability, produces a schedule of length O(c + d log(Nd)) using queues of
size O(log(Nd)) and following the constraint that at most one packet crosses an
edge per time step.

96
chapter five
Balls, Bins, and Random Graphs

In this chapter, we focus on one of the most basic of random processes: m balls are
thrown randomly into n bins, each ball landing in a bin chosen independently and uni-
formly at random. We use the techniques we have developed previously to analyze this
process and develop a new approach based on what is known as the Poisson approx-
imation. We demonstrate several applications of this model, including a more sophis-
ticated analysis of the coupon collector’s problem and an analysis of the Bloom filter
data structure. After introducing a closely related model of random graphs, we show an
efficient algorithm for finding a Hamiltonian cycle on a random graph with sufficiently
many edges. Even though finding a Hamiltonian cycle is NP-hard in general, our result
shows that, for a randomly chosen graph, the problem is solvable in polynomial time
with high probability.

5.1. Example: The Birthday Paradox

Sitting in lecture, you notice that there are 30 people in the room. Is it more likely that
some two people in the room share the same birthday or that no two people in the room
share the same birthday?
We can model this problem by assuming that the birthday of each person is a ran-
dom day from a 365-day year, chosen independently and uniformly at random for each
person. This is obviously a simplification; for example, we assume that a person’s birth-
day is equally likely to be any day of the year, we avoid the issue of leap years, and we
ignore the possibility of twins! As a model, however, it has the virtue of being easy to
understand and analyze.
One way to calculate this probability is to directly count the configurations where
two people do not share a birthday. It is easier to think about the configurations where
people do not share a birthday than about configurations where some two people do.
Thirty days must be chosen from the 365; there are 365 30
ways to do this. These 30
days
can be assigned to the people in any of the 30! possible orders. Hence there are
365
30
30! configurations where no two people share the same birthday, out of the 36530

97
balls, bins, and random graphs

ways the birthdays could occur. Thus, the probability is

365
30
30!
30
. (5.1)
365
We can also calculate this probability by considering one person at a time. The first
person in the room has a birthday. The probability that the second person has a different
birthday is (1 − 1/365). The probability that the third person in the room then has
a birthday different from the first two, given that the first two people have different
birthdays, is (1 − 2/365). Continuing on, the probability that the kth person in the
room has a different birthday than the first k − 1, assuming that the first k − 1 have
different birthdays, is (1 − (k − 1)/365). So the probability that 30 people all have
different birthdays is the product of these terms, or

1 2 3 29
1− · 1− · 1− ··· 1 − .
365 365 365 365
You can check that this matches the expression (5.1).
Calculations reveal that (to four decimal places) this product is 0.2937, so when
30 people are in the room there is more than a 70% chance that two share the same
birthday. A similar calculation shows that only 23 people need to be in the room before
it is more likely than not that two people share a birthday.
More generally, if there are m people and n possible birthdays then the probability
that all m have different birthdays is
m−1

1 2 3 m−1 j
1− · 1− · 1− ··· 1 − = 1− .
n n n n j=1
n
Using that 1 − k/n ≈ e−k/n when k is small compared to n, we see that if m is small
compared to n then

m−1
j

m−1
1− ≈ e− j/n
j=1
n j=1
⎧ ⎫
⎨ m−1 ⎬
j
= exp −
⎩ n⎭ j=1

= e−m(m−1)/2n
≈ e−m /2n .
2

Hence the value for m at which the probability that m people all have different birthdays
is 1/2 is approximately given by the equation
m2
= ln 2,
2n
√
or m = 2n ln 2. For the case n = 365, this approximation gives m = 22.49 to two
decimal places, matching the exact calculation quite well.
98
5.2 balls into bins

Quite tight and formal bounds can be established using bounds in place of the
approximations just derived, an option that is considered in Exercise 5.3. The follow-
ing simple arguments, however, give loose bounds and good intuition. Let us consider
each person one at a time, and let Ek be the event that the kth person’s birthday does
not match any of the birthdays of the first k − 1 people. Then the probability that the
first k people fail to have distinct birthdays is

k
Pr(Ē1 ∪ Ē2 ∪ · · · ∪ Ēk ) ≤ Pr(Ēi )
i=1

k
i−1
≤
i=1
n
k(k − 1)
= .
2n
√ √
If k ≤ n this probability is less than 1/2, so with n people the probability is at
least 1/2 that all birthdays will
√bedistinct.
Now assume that the first n people all have distinct birthdays. Each person after
√ √
that has probability
√ at least n/n = 1/ n of having the √same
birthday as one of these
first n people. Hence the
√ probability that the next n people all have different
birthdays than the first n people is at most
√n
1 1 1
1− √ < < .
n e 2
√
Hence, once there are 2 n people, the probability is at most 1/e that all birthdays
will be distinct.

5.2. Balls into Bins

5.2.1. The Balls-and-Bins Model

The birthday paradox is an example of a more general mathematical framework that
is often formulated in terms of balls and bins. We have m balls that are thrown into
n bins, with the location of each ball chosen independently and uniformly at random
from the n possibilities. What does the distribution of the balls in the bins look like?
The question behind the birthday paradox is whether or not there is a bin with two
balls.
There are several interesting questions that we could ask about this random process.
For example, how many of the bins are empty? How many balls are in the fullest bin?
Many of these questions have applications to the design and analysis of algorithms.
Our analysis of the birthday paradox
√ showed that, if m balls are randomly placed
into n bins then, for some m = n , at least one of the bins is likely to have more
than one ball in it. Another interesting question concerns the maximum number of
balls in a bin, or the maximum load. Let us consider the case where m = n, so that
99
balls, bins, and random graphs

the number of balls equals the number of bins and the average load is 1. Of course the
maximum possible load is n, but it is very unlikely that all n balls land in the same bin.
We seek an upper bound that holds with probability tending to 1 as n grows large. We
can show that the maximum load is more than 3 ln n/ ln ln n with probability at most
1/n for sufficiently large n via a direct calculation and a union bound. This is a very
loose bound; although the maximum load is in fact (ln n/ ln ln n) with probability
close to 1 (as we show later), the constant factor 3 we use here is chosen to simplify
the argument and could be reduced with more care.

Lemma 5.1: When n balls are thrown independently and uniformly at random into n
bins, the probability that the maximum load is more than 3 ln n/ ln ln n is at most 1/n
for n sufficiently large.

Proof: The probability that bin 1 receives at least M balls is at most

M
n 1
.
M n

This follows from a union bound; there are Mn distinct sets of M balls, and for any set
of M balls the probability that all land in bin 1 is (1/n)M . We now use the inequalities
M M
n 1 1 e
≤ ≤ .
M n M! M
Here the second inequality is a consequence of the following general bound on facto-
rials: since
ki ∞
kk
< = ek ,
k! i=0
i!

we have
k
k
k! > .
e
Applying a union bound again allows us to find that, for M ≥ 3 ln n/ ln ln n, the prob-
ability that any bin receives at least M balls is bounded above by
M
e e ln ln n 3 ln n/ln ln n
n ≤n
M 3 ln n

ln ln n 3 ln n/ln ln n
≤n
ln n
= eln n (eln ln ln n−ln ln n )3 ln n/ln ln n
= e−2 ln n+3(ln n)(ln ln ln n)/ln ln n
1
≤
n
for n sufficiently large.

100
5.3 the poisson distribution

5.2.2. Application: Bucket Sort

Bucket sort is an example of a sorting algorithm that, under certain assumptions on
the input, breaks the (n log n) lower bound for standard comparison-based sorting.
For example, suppose that we have a set of n = 2m elements to be sorted and that each
element is an integer chosen independently and uniformly at random from the range
[0, 2k ), where k ≥ m. Using Bucket sort, we can sort the numbers in expected time
O(n). Here the expectation is over the choice of the random input, since Bucket sort is
a completely deterministic algorithm.
Bucket sort works in two stages. In the first stage, we place the elements into n
buckets. The jth bucket holds all elements whose first m binary digits correspond to
the number j. For example, if n = 210 , bucket 3 contains all elements whose first 10
binary digits are 0000000011. When j < , the elements of the jth bucket all come
before the elements in the th bucket in the sorted order. Assuming that each element
can be placed in the appropriate bucket in O(1) time, this stage requires only O(n)
time. Because of the assumption that the elements to be sorted are chosen uniformly,
the number of elements that land in a specific bucket follows a binomial distribution
B(n, 1/n). Buckets can be implemented using linked lists.
In the second stage, each bucket is sorted using any standard quadratic time algo-
rithm (such as Bubblesort or Insertion sort). Concatenating the sorted lists from each
bucket in order gives us the sorted order for the elements. It remains to show that the
expected time spent in the second stage is only O(n).
The result relies on our assumption regarding the input distribution. Under the uni-
form distribution, Bucket sort falls naturally into the balls and bins model: the elements
are balls, buckets are bins, and each ball falls uniformly at random into a bin.
Let X j be the number of elements that land in the jth bucket. The time to sort the jth
bucket is then at most c(X j )2 for some constant c. The expected time spent sorting in
the second stage is at most
⎡ ⎤
n
n
E⎣ c(X j )2 ⎦ = c E[X j2 ] = cnE[X12 ],
j=1 j=1

where the first equality follows from the linearity of expectations and the second fol-
lows from symmetry, as E[X j2 ] is the same for all buckets.
Since X1 is a binomial random variable B(n, 1/n), using the results of Section 3.2.1
yields
n(n − 1) 1
E[X12 ] = 2
+ 1 = 2 − < 2.
n n
Hence the total expected time spent in the second stage is at most 2cn, so Bucket sort
runs in expected linear time.

5.3. The Poisson Distribution

We now consider the probability that a given bin is empty in the balls and bins model
with m balls and n bins as well as the expected number of empty bins. For the first bin
101
balls, bins, and random graphs

to be empty, it must be missed by all m balls. Since each ball hits the first bin with
probability 1/n, the probability the first bin remains empty is

1 m
1− ≈ e−m/n ;
n
of course, by symmetry this probability is the same for all bins. If Xi is a random variable
that is 1 when the ith bin is empty and 0 otherwise, then E[Xi ] = (1 − 1/n)m . Let X be
a random variable that represents the number of empty bins. Then, by the linearity of
expectations,
n
n
1 m
E[X] = E Xi = E[Xi ] = n 1 − ≈ ne−m/n .
i=1 i=1
n
Thus, the expected fraction of empty bins is approximately e−m/n . This approximation
is very good even for moderately size values of m and n, and we use it frequently
throughout this chapter.
We can generalize the preceding argument to find the expected fraction of bins with
r balls for any constant r. The probability that a given bin has r balls is
r
m 1 1 m−r 1 m(m − 1) · · · (m − r + 1) 1 m−r
1− = 1 − .
r n n r! nr n
When m and n are large compared to r, the second factor on the right-hand side is
approximately (m/n)r , and the third factor is approximately e−m/n . Hence the proba-
bility pr that a given bin has r balls is approximately
e−m/n (m/n)r
pr ≈ , (5.2)
r!
and the expected number of bins with exactly r balls is approximately npr . We formalize
this relationship in Section 5.3.1.
The previous calculation naturally leads us to consider the following distribution.
Definition 5.1: A discrete Poisson random variable X with parameter μ is given by
the following probability distribution on j = 0, 1, 2, . . . :
e−μ μ j
Pr(X = j) = .
j!
(Note that Poisson random variables differ from Poisson trials, discussed in Sec-
tion 4.2.1.)
Let us verify that the definition gives a proper distribution in that the probabilities
sum to 1:
∞ ∞
e−μ μ j
Pr(X = j) =
j=0 j=0
j!
∞

−μ μj
=e
j=0
j!
= 1,

where we have used the Taylor expansion ex = ∞ j=0 (x / j!).
j

102
5.3 the poisson distribution

Next we show that the expectation of this random variable is μ:

∞

E[X] = j Pr(X = j)
j=0
∞
e−μ μ j
= j
j=1
j!
∞
e−μ μ j−1
=μ
j=1
( j − 1)!
∞
e−μ μ j
=μ
j=0
j!
= μ.
In the context of throwing m balls into n bins, the distribution of the number of balls in
a bin is approximately Poisson with μ = m/n, which is exactly the average number of
balls per bin, as one might expect.
An important property of Poisson distributions is given in the following lemma.
Lemma 5.2: The sum of a finite number of independent Poisson random variables is
a Poisson random variable.
Proof: We consider two independent Poisson random variables X and Y with means
μ1 and μ2 ; the case of more random variables is simply handled by induction. Now

j
Pr(X + Y = j) = Pr((X = k) ∩ (Y = j − k))
k=0

j
e−μ1 μk1 e−μ2 μ2
( j−k)
=
k=0
k! ( j − k)!

e−(μ1 +μ2 )
j
j!
= μk1 μ2( j−k)
j! k=0
k! ( j − k)!
−(μ1 +μ2 ) j
e j
= μk1 μ2( j−k)
j! k=0
k
e−(μ1 +μ2 ) (μ1 + μ2 ) j
= .
j!
In the last equality we used the binomial theorem to simplify the summation.

We can also prove Lemma 5.2 using moment generating functions.

Lemma 5.3: The moment generating function of a Poisson random variable with
parameter μ is
Mx (t) = eμ(e −1) .
t

103
balls, bins, and random graphs

Proof: For any t,

∞
∞

e−μ μk e−μe (μet )k
t
μ(et −1)
= eμ(e −1) .
t
E[e ] =
tX
e =e
tk

k=0
k! k=0
k!
Given two independent Poisson random variables X and Y with means μ1 and μ2 , we
apply Theorem 4.3 to prove
MX+Y (t) = MX (t) · MY (t) = e(μ1 +μ2 )(e −1) ,
t

which is the moment generating function of a Poisson random variable with mean μ1 +
μ2 . By Theorem 4.2, the moment generating function uniquely defines the distribution,
and hence the sum X + Y is a Poisson random variable with mean μ1 + μ2 .
We can also use the moment generating function of the Poisson distribution to prove
that E[X 2 ] = λ(λ + 1) and Var[X] = λ (see Exercise 5.5).
Next we develop a Chernoff bound for Poisson random variables that we will use
later in this chapter.
Theorem 5.4: Let X be a Poisson random variable with parameter μ.
1. If x > μ, then
e−μ (eμ)x
Pr(X ≥ x) ≤ ;
xx
2. If x < μ, then
e−μ (eμ)x
Pr(X ≤ x) ≤ .
xx
3. For δ > 0,
μ
eδ
Pr(X ≥ (1 + δ)μ) ≤ ;
(1 + δ)(1+δ)
4. For 0 < δ < 1,
μ
e−δ
Pr(X ≤ (1 − δ)μ) ≤ .
(1 − δ)(1−δ)
Proof: For any t > 0 and x > μ,
E[etX ]
Pr(X ≥ x) = Pr(etX ≥ etx ) ≤ .
etx
Plugging in the expression for the moment generating function of the Poisson distribu-
tion, we have
Pr(X ≥ x) ≤ eμ(e −1)−x t .
t

Choosing t = ln(x/μ) > 0 gives

Pr(X ≥ x) ≤ ex−μ−x ln(x/μ)
e−μ (eμ)x
= .
xx
104
5.3 the poisson distribution

For any t < 0 and x < μ,

E[etX ]
Pr(X ≤ x) = Pr(etX ≥ etx ) ≤ .
etx
Hence
Pr(X ≤ x) ≤ eμ(e −1)−x t .
t

Choosing t = ln(x/μ) < 0, it follows that

Pr(X ≤ x) ≤ ex−μ−x ln(x/μ)
e−μ (eμ)x
= .
xx
The alternate forms of the bound given in parts 3 and 4 follow immediately from parts
1 and 2.

5.3.1. Limit of the Binomial Distribution

We have shown that, when throwing m balls randomly into n bins, the probability pr
that a bin has r balls is approximately the Poisson distribution with mean m/n. In gen-
eral, the Poisson distribution is the limit distribution of the binomial distribution with
parameters n and p, when n is large and p is small. More precisely, we have the follow-
ing limit result.
Theorem 5.5: Let Xn be a binomial random variable with parameters n and p, where
p is a function of n and limn→∞ np = λ is a constant that is independent of n. Then,
for any fixed k,
e−λ λk
lim Pr(Xn = k) = .
n→∞ k!
This theorem directly applies to the balls-and-bins scenario. Consider the situation
where there are m balls and n bins, where m is a function of n and limm→∞ m/n = λ.
Let Xm be the number of balls in a specific bin. Then Xm is a binomial random variable
with parameters m and 1/n. Theorem 5.5 thus applies and says that
e−m/n (m/n)r
lim Pr(Xm = r) = ,
m→∞ r!
matching the approximation of Eqn. (5.2).
Before proving Theorem 5.5, we describe some of its applications. Distributions of
this type arise frequently and are often modeled by Poisson distributions. For example,
consider the number of spelling or grammatical mistakes in a book, including this book.
One model for such mistakes is that each word is likely to have an error with some very
small probability p. The number of errors is then a binomial random variable with large
n and small p that can therefore be treated as a Poisson random variable. As another
example, consider the number of chocolate chips inside a chocolate chip cookie. One
possible model is to split the volume of the cookie into a large number of small disjoint
compartments, so that a chip lands in each compartment with some probability p. With
this model, the number of chips in a cookie roughly follows a Poisson distribution.
105
balls, bins, and random graphs

We will see similar applications of the Poisson distribution in continuous settings in

Chapter 8.

Proof of Theorem 5.5: We can write

n k
Pr(Xn = k) = p (1 − p)n−k .
k
In what follows, we make use of the bound that, for |x| ≤ 1,

ex (1 − x2 ) ≤ 1 + x ≤ ex , (5.3)

which follows from the Taylor series expansion of ex . (This is left as Exercise 5.7.)
Then
nk k (1 − p)n
Pr(Xn = k) ≤ p
k! (1 − p)k
(np)k e−p n
≤
k! 1 − pk
−p n
e (np)k 1
= .
k! 1 − pk
The second line follows from the first by Eqn. (5.3) and the fact that (1 − p)k ≥ 1 − pk
for k ≥ 0. Also,
(n − k + 1)k k
Pr(Xn = k) ≥ p (1 − p)n
k!
((n − k + 1)p)k −p n
≥ e (1 − p2 )n
k!
e−p n ((n − k + 1)p)k
≥ (1 − p2 n),
k!
where in the second inequality we applied Eqn. (5.3) with x = −p.
Combining, we have
e−p n (np)k 1 e−p n ((n − k + 1)p)k
≥ Pr(Xn = k) ≥ (1 − p2 n).
k! 1 − pk k!
In the limit, as n approaches infinity, p approaches zero because the limiting value of
pn is the constant λ. Hence 1/(1 − pk) approaches 1, 1 − p2 n approaches 1, and the
difference between (n − k + 1)p and np approaches 0. It follows that
e−p n (np)k 1 e−λ λk
lim =
n→∞ k! 1 − pk k!
and
e−p n ((n − k + 1)p)k e−λ λk
lim (1 − p2 n) = .
n→∞ k! k!
Since limn→∞ Pr(Xn = k) lies between these two values, the theorem follows.

106
5.4 the poisson approximation

5.4. The Poisson Approximation

The main difficulty in analyzing balls-and-bins problems is handling the dependencies

that naturally arise in such systems. For example, if we throw m balls into n bins and find
that bin 1 is empty, then it is less likely that bin 2 is empty because we know that the m
balls must now be distributed among n − 1 bins. More concretely: if we know the num-
ber of balls in the first n − 1 bins, then the number of balls in the last bin is completely
determined. The loads of the various bins are not independent, and independent random
variables are generally much easier to analyze, since we can apply Chernoff bounds. It
is therefore useful to have a general way to circumvent these sorts of dependencies.
We have already shown that, after throwing m balls independently and uniformly at
random into n bins, the distribution of the number of balls in a given bin is approxi-
mately Poisson with mean m/n. We would like to say that the joint distribution of the
number of balls in all the bins is well approximated by assuming the load at each bin is
an independent Poisson random variable with mean m/n. This would allow us to treat
bin loads as independent random variables. We show here that we can do this when
we are concerned with sufficiently rare events. Specifically, we show in Corollary 5.9
that taking the probability of an event using this Poisson approximation for all of the
√
bins and multiplying it by e m gives an upper bound for the probability of the event
√
when m balls are thrown into n bins. For rare events, this extra e m factor will not be
significant. To achieve this result, we now introduce some technical machinery.
Suppose that m balls are thrown into n bins independently and uniformly at random,
and let Xi(m) be the number of balls in the ith bin, where 1 ≤ i ≤ n. Let Y1(m) , . . . , Yn(m) be
independent Poisson random variables with mean m/n. We derive a useful relationship
between these two sets of random variables. Tighter bounds for specific problems can
often be obtained with more detailed analysis, but this approach is quite general and
easy to apply.
The difference between throwing m balls randomly and assigning each bin a number
of balls that is Poisson distributed with mean m/n is that, in the first case, we know there
are m balls in total, whereas in the second case we know only that m is the expected
number of balls in all of the bins. But suppose when we use the Poisson distribution
we end up with m balls. In this case, we do indeed have that the distribution is the same
as if we threw m balls into n bins randomly.

Theorem 5.6: The distribution of (Y1(m) , . . . , Yn(m) ) conditioned on i Yi(m) = k is the
same as (X1(k) , . . . , Xn(k) ), regardless of the value of m.
(k)
the probability that (X1 , . . . , Xn ) =
(k)
Proof: When throwing k balls into n bins,
(k1 , . . . , kn ) for any k1 , . . . , kn satisfying i ki = k is given by

k
k1 ;k2 ;...;kn k!
= .
n k (k1 !)(k2 !) · · · (kn !)nk

Now, for any k1 , . . . , kn with i ki = k, consider the probability that
(m)
Y1 , . . . , Yn(m) = (k1 , . . . , kn )
107
balls, bins, and random graphs

conditioned on (Y1(m) , . . . , Yn(m) ) satisfying i Yi(m) = k:

(m) n
Pr Y1 , . . . , Yn (m)
= (k1 , . . . , kn ) (m)
Yi = k
i=1

Pr Y1(m) = k1 ∩ Y1(m) = k2 ∩ · · · ∩ Yn(m) = kn
= n (m) .
Pr i=1 Yi =k

The probability that Yi(m) = ki is e−m/n (m/n)ki /ki !, since the Yi(m) are independent Pois-
son random variables with mean m/n. Also, by Lemma 5.2, the sum of the Yi(m) is itself
a Poisson random variable with mean m. Hence
n −m/n
Pr Y1(m) = k1 ∩ Y1(m) = k2 ∩ · · · ∩ Yn(m) = kn i=1 e (m/n)ki /ki !
n =
Pr i=1 Yi
(m)
=k e−m mk /k!
k!
= ,
(k1 !)(k2 !) · · · (kn !)nk
proving the theorem.
With this relationship between the two distributions, we can prove strong results about
any function on the loads of the bins.
Theorem 5.7: Let f (x1 , . . . , xn ) be a nonnegative function. Then
√
E f X1(m) , . . . , Xn(m) ≤ e mE f Y1(m) , . . . , Yn(m) . (5.4)
Proof: We have that
∞
n
(m)
n (m)
E f Y1 , . . . , Yn(m) = E f Y1 , . . . , Yn(m)
(m) (m)
Yi = k Pr Yi = k
k=0 i=1 i=1
n
(m)
n
≥ E f Y1 , . . . , Yn(m) Yi(m) = m Pr Yi(m) = m
i=1 i=1

= E[ f X1(m) , . . . , Xn(m) ] Pr Yi(m) =m ,
(m)
where
n the last equality follows from the fact that the joint distribution of the Y i given
(m) (m) n (m)
Y
i=1 i = m is exactly that of the Xi , as shown in Theorem 5.6. Since i=1 Yi
is Poisson distributed with mean m, we now have
mm e−m
E f Y1(m) , . . . , Yn(m) ≥ E f X1(m) , . . . , Xn(m) .
m!
We use the following loose bound on m!, which we prove as Lemma 5.8:
m
√ m
m! < e m .
e
This yields
1
E f Y1(m) , . . . , Yn(m) ≥ E f X1(m) , . . . , Xn(m) √ ,
e m
and the theorem is proven.

108
5.4 the poisson approximation

We prove the upper bound we used for factorials, which closely matches the loose lower
bound we used in Lemma 5.1.

Lemma 5.8:

√ n n
n! ≤ e n . (5.5)
e
Proof: We use the fact that

n
ln(n!) = ln i.
i=1

We first claim that, for i ≥ 2,

i
ln(i − 1) + ln i
ln x dx ≥ .
i−1 2

This follows from the fact that ln x is concave, since its second derivative is −1/x2 ,
which is always negative. Therefore,
n
n
ln n
ln x dx ≥ ln i −
1 i=1
2

or, equivalently,
ln n
n ln n − n + 1 ≥ ln(n!) − .
2
The result now follows simply by exponentiating.

Theorem 5.7 holds for any nonnegative function on the number of balls in the bins. In
particular, if the function is the indicator function that is 1 if some event occurs and 0
otherwise, then the theorem gives bounds on the probability of events. Let us call the
scenario in which the number of balls in the bins are taken to be independent Poisson
random variables with mean m/n the Poisson case, and the scenario where m balls are
thrown into n bins independently and uniformly at random the exact case.

Corollary 5.9: Any event that takes place with probability p in the Poisson case takes
√
place with probability at most pe m in the exact case.

Proof: Let f be the indicator function of the event. In this case, E[ f ] is just the
probability that the event occurs, and the result follows immediately from Theorem
5.7.

This is a quite powerful result. It says that any event that happens with small proba-
bility in the Poisson case also happens with small probability in the exact case, where
balls are thrown into bins. Since in the analysis of algorithms we often want to show
that certain events happen with small probability, this result says that we can utilize an
109
balls, bins, and random graphs

analysis of the Poisson approximation to obtain a bound for the exact case. The Pois-
son approximation is easier to analyze because the numbers of balls in each bin are
independent random variables.1
We can actually do even a little bit better in many natural cases. Part of the proof of
the following theorem is outlined in Exercises 5.14 and 5.15.
Theorem 5.10: Let f (x1 , . . . , xn ) be a nonnegative function such that
E[ f (X1(m) , . . . , Xn(m) )] is either monotonically increasing or monotonically decreasing
in m. Then

E f X1(m) , . . . , Xn(m) ≤ 2E f Y1(m) , . . . , Yn(m) . (5.6)
The following corollary is immediate.
Corollary 5.11: Let E be an event whose probability is either monotonically increas-
ing or monotonically decreasing in the number of balls. If E has probability p in the
Poisson case, then E has probability at most 2p in the exact case.
To demonstrate the utility of this corollary, we again consider the maximum load prob-
lem for the case m = n. We have shown via a union bound argument that the maximum
load is at most 3 ln n/ ln ln n with high probability. Using the Poisson approximation,
we prove the following almost-matching lower bound on the maximum load.
Lemma 5.12: When n balls are thrown independently and uniformly at random into
n bins, the maximum load is at least ln n/ ln ln n with probability at least 1 − 1/n for n
sufficiently large.
Proof: In the Poisson case, the probability that bin 1 has load at least M = ln n/ ln ln n
is at least 1/eM!, which is the probability it has load exactly M. In the Poisson case,
all bins are independent, so the probability that no bin has load at least M is at
most

1 n
1− ≤ e−n/(eM!) .
eM!
We now need to choose M so that e−n/(eM!) ≤ n−2 , for then (by Theorem 5.7) we will
have that the probability that the maximum load is not at least M in the exact case is at
√
most e n/n2 < 1/n. This will give the lemma. Because the maximum load is clearly
monotonically increasing in the number of balls, we could also apply the slightly better
Theorem 5.10, but this would not affect the argument substantially.
It therefore suffices to show that M! ≤ n/2e ln n, or equivalently that ln M! ≤ ln n −
ln ln n − ln(2e). From our bound of Eqn. (5.5), it follows that
M M
√ M M
M! ≤ e M ≤M
e e

1 There are other ways to handle the dependencies in the balls-and-bins model. In Chapter 13 we describe a more
general way to deal with dependencies (using martingales) that applies here. Also, there is a theory of negative
dependence that applies to balls-and-bins problems that also allows these dependencies to be dealt with nicely.

110
5.4 the poisson approximation

when n (and hence M = ln n/ ln ln n) are suitably large. Hence, for n suitably large,
ln M! ≤ M ln M − M + ln M
ln n ln n
= (ln ln n − ln ln ln n) − + (ln ln n − ln ln ln n)
ln ln n ln ln n
ln n
≤ ln n −
ln ln n
≤ ln n − ln ln n − ln(2e),
where in the last two inequalities we have used the fact that ln ln n = o(ln n/ ln ln n).

5.4.1.∗ Example: Coupon Collector’s Problem, Revisited

The coupon collector’s problem introduced in Section 2.4.1 can be thought of as a balls-
and-bins problem. Recall that in this problem there are n different types of coupons,
each cereal box yields a coupon chosen independently and uniformly at random from
the n types, and you need to buy cereal boxes until you collect one of each coupon.
If we think of coupons as bins and cereal boxes as balls, the question becomes: If
balls are thrown independently and uniformly at random into bins, how many balls are
thrown until all bins have at least one ball? We showed in Section 2.4.1 that the expected
number of cereal boxes necessary is nH(n) ≈ n ln n; in Section 3.3.1 we showed that, if
there are n ln n + cn cereal boxes, then the probability that not all coupons are collected
is at most e−c . These results translate immediately to the balls-and-bins setting. The
expected number of balls that must be thrown before each bin has at least one ball is
nH(n), and when n ln n + cn balls are thrown the probability that not all bins have at
least one ball is e−c .
We have seen in Chapter 4 that Chernoff bounds yield concentration results for sums
of independent 0–1 random variables. We will use here a Chernoff bound for the Pois-
son distribution to obtain much stronger results for the coupon collector’s problem.
Theorem 5.13: Let X be the number of coupons observed before obtaining one of each
of n types of coupons. Then, for any constant c,
−c
lim Pr[X > n ln n + cn] = 1 − e−e .
n→∞
This theorem states that, for large n, the number of coupons required should be very
close to n ln n. For example, over 98% of the time the number of coupons required lies
between n ln n − 4n and n ln n + 4n. This is an example of a sharp threshold, where
the random variable is closely concentrated around its mean.
Proof: We look at the problem as a balls-and-bins problem. We begin by considering
the Poisson approximation, and then demonstrate that the Poisson approximation gives
the correct answer in the limit. For the Poisson approximation, we suppose that the
number of balls in each bin is a Poisson random variable with mean ln n + c, so that
the expected total number of balls is m = n ln n + cn. The probability that a specific
bin is empty is then
e−c
e−(ln n+c) = .
n
111
balls, bins, and random graphs

Since all bins are independent under the Poisson approximation, the probability that no
bin is empty is

e−c n −c
1− ≈ e−e .
n
The last approximation is appropriate in the limit as n grows large, so we apply it here.
To show the Poisson approximation is accurate, we undertake the following steps.
Consider the experiment where each bin has a Poisson number of balls, each with mean
ln n + c. Let E be the event that no bin is empty, and let X be the number of balls thrown.
We have seen that
−c
lim Pr(E ) = e−e .
n→∞

We use Pr(E ) by splitting it as follows:

√ √
Pr(E ) = Pr E ∩ |X − m| ≤ 2m ln m + Pr E ∩ |X − m| > 2m ln m
√ √
= Pr E | |X − m| ≤ 2m ln m · Pr |X − m| ≤ 2m ln m
√ √
+ Pr E | |X − m| > 2m ln m · Pr |X − m| > 2m ln m . (5.7)
This
representation
√ proves helpful once we establish two facts. First, we show that
Pr |X − m| > 2m ln m is o(1); that is, the probability that in the Poisson case the
number of balls thrown deviates significantly from its mean m is o(1). This guarantees
that the second term in the summation on the right of Eqn. (5.7) is o(1). Second, we
show that
√
Pr E | |X − m| ≤ 2m ln m − Pr(E | X = m) = o(1).

But from Theorem 5.6, the quantity on the right is equal to the probability that every
bin has at least one ball when m balls are thrown randomly, since conditioning on m
total balls with the Poisson approximation is equivalent to throwing m balls randomly
into the n bins. As aresult, the theorem
√ follows
once we have shown these two facts.
To show that Pr |X − m| > 2m ln m is o(1), consider that X is a Poisson ran-
dom variable with mean m, since it is a sum of independent Poisson random variables.
We use the Chernoff bound for the Poisson distribution (Theorem 5.4) to bound this
112
5.5 application: hashing

probability, writing the bound as

Pr(X ≥ x) ≤ ex−m−x ln(x/m) .
√
For x = m + 2m ln m, we use that ln(1 + z) ≥ z − z2 /2 for z ≥ 0 to show
√ √ √ √
Pr X > m + 2m ln m ≤ e 2m ln m−(m+ 2m ln m) ln(1+ 2 ln m/m)
√ √ √
≤e 2m ln m−(m+ 2m ln m)( 2 ln m/m−ln m/m)
√
− ln m+ 2m ln m(ln m/m)
=e = o(1).
√
A similar argument holds if x < m, so Pr |X − m| > 2m ln m = o(1).
We now show the second fact, that
√
Pr E | |X − m| ≤ 2m ln m − Pr(E | X = m) = o(1).
Note that Pr(E | X = k) is increasing in k, since this probability corresponds to the
probability that all bins are nonempty when k balls are thrown independently and uni-
formly at random. The more balls that are thrown, the more likely all bins are nonempty.
It follows that
√ √
Pr E | X = m − 2m ln m ≤ Pr E | |X − m| ≤ 2m ln m
√
≤ Pr E | X = m + 2m ln m .
Hence we have the bound
√
Pr(E | |X − m| ≤ 2m ln m − Pr(E | X = m)
√ √
≤ Pr E | X = m + 2m ln m − Pr E | X = m − 2m ln m ,
and we show the right-hand side is o(1). This is the √ difference between the proba-
bility that all
√ bins receive at least one ball when m − 2m ln m balls are thrown and
when m + 2m ln m balls are thrown. This√difference is equivalent to the probability of
the following experiment: we throw m − 2m√ln m balls and there is still at least one
empty bin, but after throwing an additional 2 2m ln m balls, all bins are nonempty.
√
In order for this to happen, there must be at√least one empty bin after m − 2m ln m
balls;
√ the probability
that one of the next 2 2m ln m balls covers this bin is at most
2 2m ln m /n = o(1) by the union bound. Hence this difference is o(1) as well.

5.5. Application: Hashing

5.5.1. Chain Hashing

The balls-and-bins-model is also useful for modeling hashing. For example, consider
the application of a password checker, which prevents people from using common,
easily cracked passwords by keeping a dictionary of unacceptable passwords. When
a user tries to set up a password, the application would like to check if the requested
password is part of the unacceptable set. One possible approach for a password checker
would be to store the unacceptable passwords alphabetically and do a binary search on
the dictionary to check if a proposed password is unacceptable. A binary search would
require (log m) time for m words.
113
balls, bins, and random graphs

Another possibility is to place the words into bins and then search the appropriate bin
for the word. The words in a bin would be represented by a linked list. The placement
of words into bins is accomplished by using a hash function. A hash function f from a
universe U into a range [0, n − 1] can be thought of as a way of placing items from the
universe into n bins. Here the universe U would consist of possible password strings.
The collection of bins is called a hash table. This approach to hashing is called chain
hashing, since items that fall in the same bin are chained together in a linked list.
Using a hash table turns the dictionary problem into a balls-and-bins problem. If our
dictionary of unacceptable passwords consists of m words and the range of the hash
function is [0, n − 1], then we can model the distribution of words in bins with the
same distribution as m balls placed randomly in n bins. We are making a rather strong
assumption by presuming that our hash function maps words into bins in a fashion
that appears random, so that the location of each word is independent and identically
distributed. There is a great deal of theory behind designing hash functions that appear
random, and we will not delve into that theory here. We simply model the problem by
assuming that hash functions are random. In other words, we assume that (a) for each
x ∈ U, the probability that f (x) = j is 1/n (for 0 ≤ j ≤ n − 1) and that (b) the values
of f (x) for each x are independent of each other. Notice that this does not mean that
every evaluation of f (x) yields a different random answer! The value of f (x) is fixed
for all time; it is just equally likely to take on any value in the range.
Let us consider the search time when there are n bins and m words. To search for an
item, we first hash it to find the bin that it lies in and then search sequentially through
the linked list for it. If we search for a word that is not in our dictionary, the expected
number of words in the bin the word hashes to is m/n. If we search for a word that is in
our dictionary, the expected number of other words in that word’s bin is (m − 1)/n, so
the expected number of words in the bin is 1 + (m − 1)/n. If we choose n = m bins for
our hash table, then the expected number of words we must search through in a bin is
constant. If the hashing takes constant time, then the total expected time for the search
is constant.
The maximum time to search for a word, however, is proportional to the maximum
number of words in a bin. We have shown that when n = m this maximum load is
(ln n/ ln ln n) with probability close to 1, and hence with high probability this is the
maximum search time in such a hash table. While this is still faster than the required
time for standard binary search, it is much slower than the average, which can be a
drawback for many applications.
Another drawback of chain hashing can be wasted space. If we use n bins for n items,
several of the bins will be empty, potentially leading to wasted space. The space wasted
can be traded off against the search time by making the average number of words per
bin larger than 1.

5.5.2. Hashing: Bit Strings

If we want to save space instead of time, we can use hashing in another way. Again, we
consider the problem of keeping a dictionary of unsuitable passwords. Assume that a
password is restricted to be eight ASCII characters, which requires 64 bits (8 bytes) to
114
5.5 application: hashing

represent. Suppose we use a hash function to map each word into a 32-bit string. This
string will serve as a short fingerprint for the word; just as a fingerprint is a succinct way
of identifying people, the fingerprint string is a succinct way of identifying a word. We
keep the fingerprints in a sorted list. To check if a proposed password is unacceptable,
we calculate its fingerprint and look for it on the list, say by a binary search.2 If the
fingerprint is on the list, we declare the password unacceptable.
In this case, our password checker may not give the correct answer! It is possible for
a user to input an acceptable password, only to have it rejected because its fingerprint
matches the fingerprint of an unacceptable password. Hence there is some chance that
hashing will yield a false positive: it may falsely declare a match when there is not an
actual match. The problem is that – unlike fingerprints for human beings – our finger-
prints do not uniquely identify the associated word. This is the only type of mistake this
algorithm can make; it does not allow a password that is in the dictionary of unsuitable
passwords. In the password application, allowing false positives means our algorithm
is overly conservative, which is probably acceptable. Letting easily cracked passwords
through, however, would probably not be acceptable.
To place the problem in a more general context, we describe it as an approximate
set membership problem. Suppose we have a set S = {s1 , s2 , . . . , sm } of m elements
from a large universe U. We would like to represent the elements in such a way that
we can quickly answer queries of the form “is x an element of S?” We would also like
the representation to take as little space as possible. In order to save space, we would
be willing to allow occasional mistakes in the form of false positives. Here the unal-
lowable passwords correspond to our set S.
How large should the range of the hash function used to create the fingerprints be?
Specifically, if we are working with bits, how many bits should we use to create a
fingerprint? Obviously, we want to choose the number of bits that gives an acceptable
probability for a false positive match. The probability that an acceptable password has a
fingerprint that is different from any specific unallowable password in S is (1 − 1/2b ).
It follows that if the set S has size m and if we use b bits for the fingerprint, then
the probability of a false positive for an acceptable password is 1 − (1 − 1/2b )m ≥
1 − e−m/2 . If we want this probability of a false positive to be less than a constant c,
b

we need
e−m/2 ≥ 1 − c,
b

which implies that

m
b ≥ log2 .
ln(1/(1 − c))
That is, we need b = (log2 m) bits. On the other hand, if we use b = 2 log2 m bits,
then the probability of a false positive falls to

1 m 1
1− 1− 2 < .
m m
2 In this case the fingerprints will be uniformly distributed over all 32-bit strings. There are faster algorithms
for searching over sets of numbers with this distribution, just as Bucket sort allows faster sorting than standard
comparison-based sorting when the elements to be sorted are from a uniform distribution, but we will not concern
ourselves with this point here.

115
balls, bins, and random graphs

In our example, if our dictionary has 216 = 65,536 words, then using 32 bits when
hashing yields a false positive probability of just less than 1/65,536.

5.5.3. Bloom Filters

We can generalize the hashing ideas of Sections 5.5.1 and 5.5.2 to achieve more inter-
esting trade-offs between the space required and the false positive probability. The
resulting data structure for the approximate set membership problem is called a Bloom
filter.
A Bloom filter consists of an array of n bits, A[0] to A[n − 1], initially all set
to 0. A Bloom filter uses k independent random hash functions h1 , . . . , hk with range
{0, . . . , n − 1}. We make the usual assumption for analysis that these hash functions
map each element in the universe to a random number independently and uniformly
over the range {0, . . . , n − 1}. Suppose that we use a Bloom filter to represent a set
S = {s1 , s2 , . . . , sm } of m elements from a large universe U. For each element s ∈ S,
the bits A[hi (s)] are set to 1 for 1 ≤ i ≤ k. A bit location can be set to 1 multiple times,
but only the first change has an effect. To check if an element x is in S, we check whether
all array locations A[hi (x)] for 1 ≤ i ≤ k are set to 1. If not, then clearly x is not a mem-
ber of S, because if x were in S then all locations A[hi (x)] for 1 ≤ i ≤ k would be set
to 1 by construction. If all A[hi (x)] are set to 1, we assume that x is in S, although
we could be wrong. We would be wrong if all of the positions A[hi (x)] were set to 1
by elements of S even though x is not in the set. Hence Bloom filters may yield false
positives. Figure 5.1 shows an example.
The probability of a false positive for an element not in the set – the false positive
probability – can be calculated in a straightforward fashion, given our assumption that
the hash functions are random. After all the elements of S are hashed into the Bloom
filter, the probability that a specific bit is still 0 is

1 km
1− ≈ e−km/n .
n
We let p = e−km/n . To simplify the analysis, let us temporarily assume that a fraction
p of the entries are still 0 after all of the elements of S are hashed into the Bloom
filter.
The probability of a false positive is then
⎛ km ⎞k
⎝1 − 1 − 1 ⎠ ≈ (1 − e−km/n )k = (1 − p)k .
n

We let f = (1 − e−km/n )k = (1 − p)k . From now on, for convenience we use the
asymptotic approximations p and f to represent (respectively) the probability that a
bit in the Bloom filter is 0 and the probability of a false positive.
Suppose that we are given m and n and wish to optimize the number of hash func-
tions k in order to minimize the false positive probability f. There are two competing
forces: using more hash functions gives us more chances to find a 0-bit for an element
that is not a member of S, but using fewer hash functions increases the fraction of 0-bits
116
5.5 application: hashing

Figure 5.1: Example of how a Bloom filter functions.

in the array. The optimal number of hash functions that minimizes f as a function of k
is easily found taking the derivative. Let g = k ln(1 − e−km/n ), so that f = eg and mini-
mizing the false positive probability f is equivalent to minimizing g with respect to k. We
find
dg km e−km/n
= ln(1 − e−km/n ) + .
dk n 1 − e−km/n
It is easy to check that the derivative is zero when k = (ln 2) · (n/m) and that this
point is a global minimum. In this case the false positive probability f is (1/2)k ≈
(0.6185)n/m . The false positive probability falls exponentially in n/m, the number of
bits used per item. In practice, of course, k must be an integer, so the best possible
choice of k may lead to a slightly higher false positive rate.
A Bloom filter is like a hash table, but instead of storing set items we simply use one
bit to keep track of whether or not an item hashed to that location. If k = 1, we have
just one hash function and the Bloom filter is equivalent to a hashing-based fingerprint
system, where the list of the fingerprints is stored in a 0–1 bit array. Thus Bloom filters
can be seen as a generalization of the idea of hashing-based fingerprints. As we saw
when using fingerprints, to get even a small constant probability of a false positive
required (log m) fingerprint bits per item. In many practical applications, (log m)
bits per item can be too many. Bloom filters allow a constant probability of a false
positive while keeping n/m, the number of bits of storage required per item, constant.
For many applications, the small space requirements make a constant probability of
117
balls, bins, and random graphs

error acceptable. For example, in the password application, we may be willing to accept
false positive rates of 1% or 2%.
Bloom filters are highly effective even if n = cm for a small constant c, such as
c = 8. In this case, when k = 5 or k = 6 the false positive probability is just over 0.02.
This contrasts with the approach of hashing each element into (log m) bits. Bloom
filters require significantly fewer bits while still achieving a very good false positive
probability.
It is also interesting to frame the optimization another way. Consider f, the proba-
bility of a false positive, as a function of p. We find
f = (1 − p)k
= (1 − p)(− ln p)(n/m)
= (e− ln(p) ln(1−p) )n/m . (5.8)
From the symmetry of this expression, it is easy to check that p = 1/2 minimizes the
false positive probability f. Hence the optimal results are achieved when each bit of the
Bloom filter is 0 with probability 1/2. An optimized Bloom filter looks like a random
bit string.
To conclude, we reconsider our assumption that the fraction of entries that are still 0
after all of the elements of S are hashed into the Bloom filter is p. Each bit in the array
can be thought of as a bin, and hashing an item is like throwing a ball. The fraction of
entries that are still 0 after all of the elements of S are hashed is therefore equivalent to
the fraction of empty bins after mk balls are thrown into n bins. Let X be the number of
such bins when mk balls are thrown. The expected fraction of such bins is

1 km
p = 1− .
n
The events of different bins being empty are not independent, but we can apply
Corollary 5.9, along with the Chernoff bound of Eqn. (4.6), to obtain
√
Pr(|X − np | ≥ εn) ≤ 2e ne−nε /3p .
2

Actually, Corollary 5.11 applies as well, since the number of 0-entries – which corre-
sponds to the number of empty bins – is monotonically decreasing in the number of
balls thrown. The bound tells us that the fraction of empty bins is close to p (when
n is reasonably large) and that p is very close to p. Our assumption that the fraction
of 0-entries in the Bloom filter is p is therefore quite accurate for predicting actual
performance.

5.5.4. Breaking Symmetry

As our last application of hashing, we consider how hashing provides a simple way
to break symmetry. Suppose that n users want to utilize a resource, such as time on a
supercomputer. They must use the resource sequentially, one at a time. Of course, each
user wants to be scheduled as early as possible. How can we decide a permutation of
the users quickly and fairly?
118
5.6 random graphs

If each user has an identifying name or number, hashing provides one possible solu-
tion. Hash each user’s identifier into 2b bits, and then take the permutation given by
the sorted order of the resulting numbers. That is, the user whose identifier gives the
smallest number when hashed comes first, and so on. For this approach to work, we do
not want two users to hash to the same value, since then we must decide again how to
order these users.
If b is sufficiently large, then with high probability the users will all obtain distinct
hash values. One can analyze the probability that two hash values collide by using the
analysis from Section 5.1 for the birthday paradox; hash values correspond to birthdays.

We here use a simpler analysis based just on using a union bound. There are n2 pairs
of users. The probability that any specific pair has the same hash value is 1/2b . Hence
the probability that any pair has the same hash value is at most

n 1 n(n − 1)
b
= .
2 2 2b+1

Choosing b = 3 log2 n − 1 guarantees success with probability at least 1 − 1/n.

This solution is extremely flexible, making it useful for many situations in dis-
tributed computing. For example, new users can easily be added into the schedule
at any time, as long as they do not hash to the same number as another scheduled
user.
A related problem is leader election. Suppose that instead of trying to order all of
the users, we simply want to fairly choose a leader from them. Again, if we have a
suitably random hash function then we can simply take the user whose hash value is
the smallest. An analysis of this scheme is left as Exercise 5.26.

5.6. Random Graphs

5.6.1. Random Graph Models

There are many NP-hard computational problems defined on graphs: Hamiltonian
cycle, independent set, vertex cover, and so forth. One question worth asking is whether
these problems are hard for most inputs or just for a relatively small fraction of all
graphs. Random graph models provide a probabilistic setting for studying such ques-
tions.
Most of the work on random graphs has focused on two closely related models, Gn,p
and Gn,N . In Gn,p we consider all undirected graphs on n distinct vertices v1 , v2 , . . . , vn .
A graph with a given set of m edges has probability

pm (1 − p)(2 )−m .
n

One way to generate a random graph in Gn,p is to consider each of the n2 possible edges
in some order and then independently add each edge to the graph with probability p.
119
balls, bins, and random graphs

The expected number of edges in the graph is therefore n2 p, and each vertex has
expected degree (n − 1)p.
In the Gn,N model,
n we consider all undirected graphs on n vertices with exactly N
edges. There are (N2 ) possible graphs, each selected with equal probability. One way
from the graphs in Gn,N is to start with a graph with no
to generate a graph uniformly
edges. Choose one of the n2 possible edges uniformly n at random and add it to the edges
in the graph. Now choose one of the remaining 2 − 1 possible edges independently
and uniformly at random and add it to the graph. Similarly, continue choosing one of
the remaining unchosen edges independently and uniformly at random until there are
N edges.
The Gn,p and Gn,N models are related; when p = N/ n2 , the number of edges in a
random graph in Gn,p is concentrated around N, and conditioned on a graph from Gn,p
having N edges, that graph is uniform over all the graphs from Gn,N . The relationship
is similar to the relationship between throwing m balls into n bins and having each bin
have a Poisson distributed number of balls with mean m/n.
Here, for example is one way of formalizing the relationship between the Gn,p and
Gn,N models. A graph property is a property that holds for a graph regardless of how
the vertices are labeled, so it holds for all possible isomorphisms of the graph. We say
that a graph property is monotone increasing if whenever the property holds for G =
(V, E ) it holds also for any graph G = (V, E ) with E ⊆ E ; monotone decreasing graph
properties are defined similarly. For example, the property that a graph is connected
is a monotone increasing graph property, as is the property that a graph contains a
connected component of at least k vertices for any particular value of k. The property
that a graph is a tree, however, is not a monotone graph property, although the property
that the graph contains no cycles is a monotone decreasing graph property. We have
the following lemma:

Lemma 5.14: For a given monotone increasing graph property let P(n, N) be the
probability that the property holds for a graph in Gn,N and P(n, p) the probability
that
it holds for a graph in Gn,p . Let p+ = (1 + )N/ n2 and p− = (1 − )N/ n2 for a
constant 1 > > 0. Then
P(n, p− ) − e−O(N ) ≤ P(n, N) ≤ P(n, p+ ) + e−O(N ) .
Proof: Let X be a random variable giving the number of edges that occur when a graph
is chosen from Gn,p− . Conditioned on X = k, a random graph from Gn,p− is equivalent
to a graph from Gn,k , since the k edges chosen are equally likely to be any subset of k
edges. Hence
( n2 )

−
P(n, p ) = P(n, k) Pr(X = k).
k=0

In particular,

P(n, p− ) = P(n, k) Pr(X = k) + P(n, k) Pr(X = k).
k≤N k>N

120
5.6 random graphs

Also, for a monotone increasing graph property, P(n, k) ≤ P(n, N) for k ≤ N. Hence
P(n, p− ) ≤ Pr(X ≤ N)P(n, N) + Pr(X > N) ≤ P(n, N) + Pr(X > N).

However, Pr(X > N) can be bounded by a standard Chernoff bound; X is the sum of
n
2
independent Bernoulli random variables, and hence by Theorem 4.4

1
E[X] ≤ Pr(X > (1 + )E[X]) ≤ e−(1−) N/3 .
2
Pr(X > N) = Pr X >
1−
Here we have used that 1
1−
> 1 + for 0 < < 1.
Similarly,

P(n, p+ ) = P(n, k) Pr(X = k) + P(n, k) Pr(X = k),
k<N k≥N

so
P(n, p+ ) ≥ Pr(X ≥ N)P(n, N) ≥ P(n, N) − Pr(X < N).
By Theorem 4.5

1
E[X] ≤ e−(1+) N/8 ,
2
Pr(X > N) = Pr X < E[X] ≤ Pr X < 1 +
1+ 2
where here we have used that 1
1+
< 1 − /2 for 0 < < 1.

A similar result holds for monotone decreasing properties. Another formalization of

the relationship between the graph models is given as Exercise 5.18.
There are many additional similarities between random graphs and the balls-and-
bins model. Throwing edges into the graph as in the Gn,N model is like throwing balls
into bins. However, since each edge has two endpoints, each edge is like throwing
two balls at once into two different bins. The pairing defined by the edges adds a rich
structure that does not exist in the balls-and-bins model. Yet we can often utilize the
relation between the two models to simplify analysis in random graph models. For
example, in the coupon collector’s problem we found that when we throw n ln n + cn
−c
balls, the probability that there are no empty bins converges to e−e as n grows to
infinity. Similarly, we have the following theorem for random graphs, which is left as
Exercise 5.20.
Theorem 5.15: Let N = 12 (n ln n + cn). Then the probability that there are no isolated
−c
vertices (vertices with degree 0) in Gn,N converges to e−e as n grows to infinity.

5.6.2. Application: Hamiltonian Cycles in Random Graphs

A Hamiltonian path in a graph is a path that traverses each vertex exactly once. A
Hamiltonian cycle is a cycle that traverses each vertex exactly once. We show an inter-
esting connection between random graphs and balls-and-bins problems by analyzing a
simple and efficient algorithm for finding Hamiltonian cycles in random graphs. The
algorithm is randomized, and its probabilistic analysis is over both the input distribu-
tion and the random choices of the algorithm. Finding a Hamiltonian cycle in a graph
121
balls, bins, and random graphs

Figure 5.2: The rotation of the path v1 , v2 , v3 , v4 , v5 , v6 with the edge (v6 , v3 ) yields a new path
v1 , v2 , v3 , v6 , v5 , v4 .

is an NP-hard problem. However, our analysis of this algorithm shows that finding a
Hamiltonian cycle is not hard for suitably randomly selected graphs, even though it
may be hard to solve in general.
Our algorithm will make use of a simple operation called a rotation. Let G be an
undirected graph. Suppose that
P = v1 , v2 , . . . , vk
is a simple path in G and that (vk , vi ) is an edge of G. Then
P = v1 , v2 , . . . , vi , vk , vk−1 , . . . , vi+2 , vi+1
is also a simple path, which we refer to as the rotation of P with the rotation edge
(vk , vi ); see Figure 5.2.
We first consider a simple, natural algorithm that proves challenging to analyze. We
assume that our input is presented as a list of adjacent edges for each vertex in the graph,
with the edges of each list being given in a random order according to independent and
uniform random permutations. Initially, the algorithm chooses an arbitrary vertex to
start the path; this is the initial head of the path. The head is always one of the endpoints
of the path. From this point on, the algorithm either “grows” the path deterministically
from the head, or rotates the path – as long as there is an adjacent edge remaining on
the head’s list. See Algorithm 5.1.
The difficulty in analyzing this algorithm is that, once the algorithm views some
edges in the edge lists, the distribution of the remaining edges is conditioned on the
edges the algorithm has already seen. We circumvent this difficulty by considering a
modified algorithm that, though less efficient, avoids this conditioning issue and so is
easier to analyze for the random graphs we consider. See Algorithm 5.2. Each ver-
tex v keeps two lists. The list used-edges(v) contains edges adjacent to v that have
been used in the course of the algorithm while v was the head; initially this list is
empty. The list unused-edges(v) contains other edges adjacent to v that have not been
used.
We initially analyze the algorithm assuming a specific model for the initial unused-
edges lists. We subsequently relate this model to the Gn,p model for random graphs.
Assume that each of the n − 1 possible edges connected to a vertex v is initially on the
unused-edges list for vertex v independently with some probability q. We also assume
these edges are in a random order. One way to think of this is that, before beginning
the algorithm, we create the unused-edges list for each vertex v by inserting each pos-
sible edge (v, u) with probability q; we think of the corresponding graph G as being
the graph including all edges that were inserted on some unused-edges list. Notice that
122
5.6 random graphs

Hamiltonian Cycle Algorithm:

Input: A graph G = (V, E ) with n vertices.

Algorithm 5.1: Hamiltonian cycle algorithm.

Modified Hamiltonian Cycle Algorithm:

Input: A graph G = (V, E ) with n vertices and associated edge lists.

Output: A Hamiltonian cycle, or failure.
1. Start with a random vertex as the head of the path.
2. Repeat the following steps until the rotation edge closes a Hamiltonian cycle or
the unused-edges list of the head of the path is empty:
(a) Let the current path be P = v1 , v2 , . . . , vk , with vk being the head.
(b) Execute i, ii, or iii with probabilities 1/n, |used-edges(vk )|/n, and
1 − 1/n − |used-edges(vk )|/n, respectively:
i. Reverse the path, and make v1 the head.
ii. Choose uniformly at random an edge from used-edges(vk ); if the edge
is (vk , vi ), rotate the current path with (vk , vi ) and set vi+1 to be the head.
(If the edge is (vk , vk−1 ), then no change is made.)
iii. Select the first edge from unused-edges(vk ), call it (vk , u). If u = vi for
1 ≤ i ≤ k, add u = vk+1 to the end of the path and make it the head.
Otherwise, if u = vi , rotate the current path with (vk , vi ) and set vi+1
to be the head. (This step closes the Hamiltonian path if k = n and the
chosen edge is (vn , v1 ).)
(c) Update the used-edges and unused-edges lists appropriately.
3. Return a Hamiltonian cycle if one was found or failure if no cycle was found.

Algorithm 5.2: Modified Hamiltonian cycle algorithm.

123
balls, bins, and random graphs

this means an edge (v, u) could initially be on the unused-edges list for v but not for
u. Also, when an edge (v, u) is first used in the algorithm, if v is the head then it is
removed just from the unused-edges list of v; if the edge is on the unused-edges list for
u, it remains on this list.
By choosing the rotation edge from either the used-edges list or the unused-edges
list with appropriate probabilities and then reversing the path with some small proba-
bility in each step, we modify the rotation process so that the next head of the list is
chosen uniformly at random from among all vertices of the graph. Once we establish
this property, the progress of the algorithm can be analyzed through a straightforward
application of our analysis of the coupon collector’s problem.
The modified algorithm appears wasteful; reversing the path or rotating with one of
the used edges cannot increase the path length. Also, we may not be taking advantage
of all the possible edges of G at each step. The advantage of the modified algorithm is
that it proves easier to analyze, owing to the following lemma.

Lemma 5.16: Suppose the modified Hamiltonian cycle algorithm is run on a graph
chosen using the described model. Let Vt be the head vertex after the tth step. Then, for
any vertex u, as long as at the tth step there is at least one unused edge available at the
head vertex,

Pr(Vt+1 = u | Vt = ut , Vt−1 = ut−1 , . . . , V0 = u0 ) = 1/n.

That is, the head vertex can be thought of as being chosen uniformly at random from
all vertices at each step, regardless of the history of the process.

Proof: Consider the possible cases when the path is P = v1 , v2 , . . . , vk .

The only way v1 can become the head is if the path is reversed, so Vt+1 = v1 with
probability 1/n.
If u = vi+1 is a vertex that lies on the path and (vk , vi ) is in used-edges(vk ), then the
probability that Vt+1 = u is
|used-edges(vk )| 1 1
= .
n |used-edges(vk )| n
If u is not covered by one of the first two cases then we use the fact that, when
an edge is chosen from unused-edges(vk ), the adjacent vertex is uniform over all
the n − |used-edges(vk )| − 1 remaining vertices. This follows from the principle of
deferred decisions. Our initial setup required the unused-edges list for vk to be con-
structed by including each possible edge with probability q and randomizing the order
of the list. This is equivalent to choosing X neighboring vertices for vk , where X is
a B(n − 1, q) random variable and the X vertices are chosen uniformly at random
without replacement. Because vk ’s list was determined independently from the lists
of the other vertices, the history of the algorithm tells us nothing about the remaining
edges in unused-edges(vk ), and the principle of deferred decisions applies. Hence any
edge in vk ’s unused-edges list that we have not seen is by construction equally likely
to connect to any of the n − |used-edges(vk )| − 1 remaining possible neighboring
vertices.
124
5.6 random graphs

If u = vi+1 is a vertex on the path but (vk , vi ) is not in used-edges(vk ), then the
probability that Vt+1 = u is the probability that the edge (vk , vi ) is chosen from unused-
edges(vk ) as the next rotation edge, which is

1 |used-edges(vk )| 1 1
1− − = . (5.9)
n n n − |used-edges(vk )| − 1 n
Finally, if u is not on the path, then the probability that Vt+1 = u is the probability that
the edge (vk+1 , u) is chosen from unused-edges(vk ). But this has the same probability
as in Eqn. (5.9).

For Algorithm 5.2, the problem of finding a Hamiltonian path looks exactly like the
coupon collector’s problem; the probability of finding a new vertex to add to the path
when there are k vertices left to be added is k/n. Once all the vertices are on the
path, the probability that a cycle is closed in each rotation is 1/n. Hence, if no list
of unused-edges is exhausted then we can expect a Hamiltonian path to be formed in
about O(n ln n) rotations, with about another O(n ln n) rotations to close the path to
form a Hamiltonian cycle. More concretely, we can prove the following theorem.
Theorem 5.17: Suppose the input to the modified Hamiltonian cycle algorithm ini-
tially has unused-edge lists where each edge (v, u) with u = v is placed on v’s list
independently with probability q ≥ 20 ln n/n. Then the algorithm successfully finds a
Hamiltonian cycle in O(n ln n) iterations of the repeat loop (step 2) with probability
1 − O(n−1 ).
Note that we did not assume that the input random graph has a Hamiltonian cycle. A
corollary of the theorem is that, with high probability, a random graph chosen in this
way has a Hamiltonian cycle.
Proof of Theorem 5.17: Consider the following two events.
E1 : The algorithm ran for 3n ln n steps with no unused-edges list becoming empty, but
it failed to construct a Hamiltonian cycle.
E2 : At least one unused-edges list became empty during the first 3n ln n iterations of
the loop.
For the algorithm to fail, either event E1 or E2 must occur. We first bound the proba-
bility of E1 . Lemma 5.16 implies that, as long as there is no empty unused-edges list in
the first 3n ln n iterations of step 2 of Algorithm 5.2, in each iteration the next head of
the path is uniform among the n vertices of the graph. To bound E1 , we therefore con-
sider the probability that more than 3n ln n iterations are required to find a Hamiltonian
cycle when the head is chosen uniformly at random each iteration.
The probability that the algorithm takes more than 2n ln n iterations to find a Hamil-
tonian path is exactly the probability that a coupon collector’s problem on n types
requires more than 2n ln n coupons. The probability that any specific coupon type has
not been found among 2n ln n random coupons is

1 2n ln n 1
1− ≤ e−2 ln n = 2 .
n n
125
balls, bins, and random graphs

By the union bound, the probability that any coupon type is not found is at most
1/n.
In order to complete a Hamiltonian path to a cycle the path must close, which it does
at each step with probability 1/n. Hence the probability that the path does not become
a cycle within the next n ln n iterations is
n ln n
1 1
1− ≤ e− ln n = .
n n

Thus we have shown that

2
Pr(E1 ) ≤ .
n
Next we bound Pr(E2 ), the probability that an unused-edges list is empty in the first
3n ln n iterations. We consider two subevents as follows.
E2a : At least 9 ln n edges were removed from the unused-edges list of at least one vertex
in the first 3n ln n iterations of the loop.
E2b : At least one vertex had fewer than 10 ln n edges initially in its unused-edges list.
For E2 to occur, either E2a or E2b must occur. Hence
Pr(E2 ) ≤ Pr(E2a ) + Pr(E2b ).
Let us first bound Pr(E2a ). Exactly one edge is used in each iteration of the loop.
From the proof of Lemma 5.16 we have that, at each iteration, the probability that a
given vertex v is the head of the path is 1/n, independently at each step. Hence the
number of times X that v is the head during the first 3n ln n steps is a binomial random
variable B(3n ln n, 1/n), and this dominates the number of edges taken from v’s unused-
edges list.
Using the Chernoff bound of Eqn. (4.1) with δ = 2 and μ = 3 ln n for the binomial
random variable B(3n ln n, 1/n), we have
2 3 ln n
e 1
Pr(X ≥ 9 ln n) ≤ ≤ 2.
27 n
By taking a union bound over all vertices, we find Pr(E2a ) ≤ 1/n.
Next we bound Pr(E2b ). The expected number of edges Y initially in a vertex’s
unused-edges list is at least (n − 1)q ≥ (20(n − 1) ln n)/n ≥ 19 ln n for sufficiently
large n. Using Chernoff bounds again (Eqn. (4.5)), the probability that any vertex ini-
tially has 10 ln n edges or fewer on its list is at most
1
Pr(Y ≤ 10 ln n) ≤ e−(19 ln n)(9/19) /2 ≤
2
,
n2
and by the union bound the probability that any vertex has too few adjacent edges is at
most 1/n. Thus,
1
Pr(E2b ) ≤
n
126
5.7 exercises

and hence
2
Pr(E2 ) ≤ .
n
In total, the probability that the algorithm fails to find a Hamiltonian cycle in 3n ln n
iterations is bounded by
4
Pr(E1 ) + Pr(E2 ) ≤ .
n

We did not make an effort to optimize the constants in the proof. There is, how-
ever, a clear trade-off; with more edges, one could achieve a lower probability of
failure.
We are left with showing how our algorithm can be applied to graphs in Gn,p . We
show that, as long as p is known, we can partition the edges of the graph into edge lists
that satisfy the requirements of Theorem 5.17.

Corollary 5.18: By initializing edges on the unused-edges lists appropriately, Algo-

rithm 5.2 will find a Hamiltonian cycle on a graph chosen randomly from Gn,p with
probability 1 − O(1/n) whenever p ≥ (40 ln n)/n.

Proof: We partition the edges of our input graph from Gn,p as follows. Let q ∈ [0, 1] be
such that p = 2q − q2 . Consider any edge (u, v) in the input graph. We execute exactly
one of the following three possibilities: with probability q(1 − q)/(2q − q2 ) we place
the edge on u’s unused-edges list but not on v’s; with probability q(1 − q)/(2q − q2 ) we
initially place the edge on v’s unused-edges list but not on u’s; and with the remaining
probability q2 /(2q − q2 ) the edge is placed on both unused-edges lists.
Now, for any possible edge (u, v), the probability that it is initially placed in the
unused-edges list for v is

q(1 − q) q2
p + = q.
2q − q2 2q − q2
Moreover, the probability that an edge (u, v) is initially placed on the unused-edges
list for both u and v is pq2 /(2q − q2 ) = q2 , so these two placements are indepen-
dent events. Since each edge (u, v) is treated independently, this partitioning fulfills
the requirements of Theorem 5.17 provided the resulting q is at least 20 ln n/n. When
p ≥ (40 ln n)/n we have q ≥ p/2 ≥ (20 ln n)/n, and the result follows.

In Exercise 5.27, we consider how to use Algorithm 5.2 even in the case where p is not
known in advance, so that the edge lists must be initialized without knowledge of p.

5.7. Exercises

Exercise 5.1: For what values of n is (1 + 1/n)n within 1% of e? Within 0.0001%

of e? Similarly, for what values of n is (1 − 1/n)n within 1% of 1/e? Within 0.0001%?

127
balls, bins, and random graphs

Exercise 5.2: Suppose that Social Security numbers were issued uniformly at random,
with replacement. That is, your Social Security number would consist of just nine ran-
domly generated digits, and no check would be made to ensure that the same number
was not issued twice. Sometimes, the last four digits of a Social Security number are
used as a password. How many people would you need to have in a room before it was
more likely than not that two had the same last four digits? How many numbers could
be issued before it would be more likely than not that there is a duplicate number? How
would you answer these two questions if Social Security numbers had 13 digits? Try
to give exact numerical answers.

Exercise 5.3: Suppose that balls are thrown randomly into n bins. Show, for some
√
constant c1 , that if there are c1 n balls then the probability that no two land in the
same bin is at most 1/e. Similarly, show for some constant c2 (and sufficiently large n)
√
that, if there are c2 n balls, then the probability that no two land in the same bin is at
least 1/2. Make these constants as close to optimal as possible. Hint: You may want to
use the facts that
e−x ≥ 1 − x
and
1
e−x−x ≤ 1 − x
2
for x≤ .
2
Exercise 5.4: In a lecture hall containing 100 people, you consider whether or not
there are three people in the room who share the same birthday. Explain how to calculate
this probability exactly, using the same assumptions as in our previous analysis.

Exercise 5.5: Use the moment generating function of the Poisson distribution to com-
pute the second moment and the variance of the distribution.

Exercise 5.6: Let X be a Poisson random variable with mean μ, representing the num-
ber of errors on a page of this book. Each error is independently a grammatical error
with probability p and a spelling error with probability 1 − p. If Y and Z are random
variables representing the number of grammatical and spelling errors (respectively) on
a page of this book, prove that Y and Z are Poisson random variables with means μp
and μ(1 − p), respectively. Also, prove that Y and Z are independent.

Exercise 5.7: Use the Taylor expansion

x2 x3 x4
ln(1 + x) = x − + − + ···
2 3 4
to prove that, for any x with |x| ≤ 1,
ex (1 − x2 ) ≤ 1 + x ≤ ex .

Exercise 5.8: Suppose that n balls are thrown independently and uniformly at random
into n bins.
128
5.7 exercises

(a) Find the conditional probability that bin 1 has one ball given that exactly one ball
fell into the first three bins.
(b) Find the conditional expectation of the number of balls in bin 1 under the condition
that bin 2 received no balls.
(c) Write an expression for the probability that bin 1 receives more balls than bin 2.

Exercise 5.9: Our analysis of Bucket sort in Section 5.2.2 assumed that n elements
were chosen independently and uniformly at random from the range [0, 2k ). Suppose
instead that n elements are chosen independently from the range [0, 2k ) according to
a distribution with the property that any number x ∈ [0, 2k ) is chosen with probability
at most a/2k for some fixed constant a > 0. Show that, under these conditions, Bucket
sort still requires linear expected time.

Exercise 5.10: Consider the probability that every bin receives exactly one ball when
n balls are thrown randomly into n bins.

(a) Give an upper bound on this probability using the Poisson approximation.
(b) Determine the exact probability of this event.
(c) Show that these two probabilities differ by a multiplicative factor that equals the
probability that a Poisson random variable with parameter n takes on the value n.
Explain why this is implied by Theorem 5.6.

Exercise 5.11: Consider throwing m balls into n bins, and for convenience let the
bins be numbered from 0 to n − 1. We say there is a k-gap starting at bin i if bins
i, i + 1, . . . , i + k − 1 are all empty.

(a) Determine the expected number of k-gaps.

(b) Prove a Chernoff-like bound for the number of k-gaps. (Hint: If you let Xi = 1
when there is a k-gap starting at bin i, then there are dependencies between Xi and
Xi+1 ; to avoid these dependencies, you might consider Xi and Xi+k .)

Exercise 5.12: The following problem models a simple distributed system wherein
agents contend for resources but “back off” in the face of contention. Balls represent
agents, and bins represent resources.
The system evolves over rounds. Every round, balls are thrown independently and
uniformly at random into n bins. Any ball that lands in a bin by itself is served and
removed from consideration. The remaining balls are thrown again in the next round.
We begin with n balls in the first round, and we finish when every ball is served.

(a) If there are b balls at the start of a round, what is the expected number of balls at
the start of the next round?
(b) Suppose that every round the number of balls served was exactly the expected
number of balls to be served. Show that all the balls would be served in O(log log n)
rounds. (Hint: If x j is the expected number of balls left after j rounds, show and
use that x j+1 ≤ x2j /n.)

129
balls, bins, and random graphs

Exercise 5.13: Suppose that we vary the balls-and-bins process as follows. For conve-
nience let the bins be numbered from 0 to n − 1. There are log2 n players. Each player
randomly chooses a starting location uniformly from [0, n − 1] and then places one
ball in each of the bins numbered mod n, + 1 mod n, . . . , + n/ log2 n − 1 mod n.
Argue that the maximum load in this case is only O(log log n/ log log log n) with prob-
ability that approaches 1 as n → ∞.

Exercise 5.14: We prove that if Z is a Poisson random variable of mean μ, where

μ ≥ 1 is an integer, then Pr(Z ≥ μ) ≥ 1/2.
(a) Show that Pr(Z = μ + h) ≥ Pr(Z = μ − h − 1) for 0 ≤ h ≤ μ − 1.
(b) Using part (a), argue that Pr(Z ≥ μ) ≥ 1/2.
(c) Show that Pr(Z ≤ μ) ≤ 1/2 for all integers μ from 1 to 10 by explicitly performing
the calculation. (This is in fact true for all integers μ ≥ 1, but it is more difficult to
prove.)

Exercise 5.15: (a) In Theorem 5.7 we showed that, for any nonnegative functions f,
(m)
E f Y1(m) , . . . , Yn(m) ≥ E f X1(m) , . . . , Xn(m) Pr Yi = m .

Prove that if E[ f (X1(m) , . . . , Xn(m) )] is monotonically increasing in m, then

(m)
E f Y1(m) , . . . , Yn(m) ≥ E f X1(m) , . . . , Xn(m) Pr Yi ≥ m ,
again under the condition that f is nonnegative. Make a similar statement for the case
when E[ f (X1(m) , . . . , Xn(m) )] is monotonically decreasing in m.
(b) Using part (a) and Exercise 5.14, prove Theorem 5.10 for the case that
E[ f (X1(m) , . . . , Xn(m) )] is monotonically increasing.

Exercise 5.16: We consider another way to obtain Chernoff-like bounds in the setting
of balls and bins without using Theorem 5.7. Consider n balls thrownrandomly into
n bins. Let Xi = 1 if the ith bin is empty and 0 otherwise. Let X = ni=1 Xi . Let Yi ,
i = 1, . . . , n, be independent
Bernoulli random variables that are 1 with probability
p = (1 − 1/n)n . Let Y = ni=1 Yi .
(a) Show that E[X1 X2 · · · Xk ] ≤ E[Y1Y2 · · · Yk ] for any k ≥ 1.
(b) Show that E[etX ] ≤ E[etY ] for all t ≥ 0. (Hint: Use the expansion for ex and com-
pare E[X k ] to E[Y k ].)
(c) Derive a Chernoff bound for Pr(X ≥ (1 + δ)E[X]).

Exercise 5.17: Let G be a random graph generated using the Gn,p model.

(a) A clique of k vertices in a graph is a subset of k vertices such that all 2k edges
between these vertices lie in the graph. For what value of p, as a function of n, is
the expected number of cliques of five vertices in G equal to 1?
(b) A K3,3 graph is a complete bipartite graph with three vertices on each side. In other
words, it is a graph with six vertices and nine edges; the six distinct vertices are
arranged in two groups of three, and the nine edges connect each of the nine pairs
130
5.7 exercises

of vertices with one vertex in each group. For what value of p, as a function of n,
is the expected number of K3,3 subgraphs of G equal to 1?
(c) For what value of p, as a function of n, is the expected number of Hamiltonian
cycles in the graph equal to 1?

Exercise 5.18: Theorem 5.7 shows that any event that occurs with small probability
in the balls-and-bins setting where the number of balls in each bin is an independent
Poisson random variable also occurs with small probability in the standard balls-and-
bins model. Prove a similar statement for random graphs: Every event that happens
in the Gn,p model also happens with small probability in the
with small probability
Gn,N model for N = n2 p.

Exercise 5.19: An undirected graph on n vertices is disconnected if there exists a set

of k < n vertices such that there is no edge between this set and the rest of the graph.
Otherwise, the graph is said to be connected. Show that there exists a constant c such
that if N ≥ cn log n then, with probability 1 − o(1), a graph randomly chosen from Gn,N
is connected.

Exercise 5.20: Prove Theorem 5.15.

Exercise 5.21: (a) Let f (n) be the expected number of random edges that must be
added before an empty undirected graph with n vertices becomes connected. (Con-
nectedness is defined in Exercise 5.19.) That is, suppose that we start with a graph on
n vertices with zero edges and then repeatedly add an edge, chosen uniformly at ran-
dom from all edges not currently in the graph, until the graph becomes connected. If
Xn represents the number of edges added, then f (n) = E[Xn ].
Write a program to estimate f (n) for a given value of n. Your program should track
the connected components of the graph as you add edges until the graph becomes con-
nected. You will probably want to use a disjoint set data structure, a topic covered in
standard undergraduate algorithms texts. You should try n = 100, 200, 300, 400, 500,
600, 700, 800, 900, and 1000. Repeat each experiment 100 times, and for each value of
n compute the average number of edges needed. Based on your experiments, suggest a
function h(n) that you think is a good estimate for f (n).
(b) Modify your program for the problem in part (a) so that it also keeps track of
isolated vertices. Let g(n) be the expected number of edges added before there are no
more isolated vertices. What seems to be the relationship between f (n) and g(n)?

Exercise 5.22: In hashing with open addressing, the hash table is implemented as an
array and there are no linked lists or chaining. Each entry in the array either contains
one hashed item or is empty. The hash function defines, for each key k, a probe sequence
h(k, 0), h(k, 1), . . . of table locations. To insert the key k, we first examine the sequence
of table locations in the order defined by the key’s probe sequence until we find an
empty location; then we insert the item at that position. When searching for an item in
the hash table, we examine the sequence of table locations in the order defined by the
key’s probe sequence until either the item is found or we have found an empty location
131
balls, bins, and random graphs

in the sequence. If an empty location is found, this means the item is not present in the
table.
An open-address hash table with 2n entries is used to store n items. Assume that the
table location h(k, j) is uniform over the 2n possible table locations and that all h(k, j)
are independent.
(a) Show that, under these conditions, the probability of an insertion requiring more
than k probes is at most 2−k .
(b) Show that, for i = 1, 2, . . . , n, the probability that the ith insertion requires more
than 2 log n probes is at most 1/n2 .
Let the random variable Xi denote the number of probes required by the ith insertion.
You have shown in part (b) that Pr(Xi > 2 log n) ≤ 1/n2 . Let the random variable X =
max1≤i≤n Xi denote the maximum number of probes required by any of the n insertions.
(c) Show that Pr(X > 2 log n) ≤ 1/n.
(d) Show that the expected length of the longest probe sequence is E[X] = O(log n).

Exercise 5.23: Bloom filters can be used to estimate set differences. Suppose you have
a set X and I have a set Y, both with n elements. For example, the sets might represent
our 100 favorite songs. We both create Bloom filters of our sets, using the same number
of bits m and the same k hash functions. Determine the expected number of bits where
our Bloom filters differ as a function of m, n, k, and |X ∩ Y |. Explain how this could be
used as a tool to find people with the same taste in music more easily than comparing
lists of songs directly.

Exercise 5.24: Suppose that we wanted to extend Bloom filters to allow deletions as
well as insertions of items into the underlying set. We could modify the Bloom filter
to be an array of counters instead of an array of bits. Each time an item is inserted
into a Bloom filter, the counters given by the hashes of the item are increased by one.
To delete an item, one can simply decrement the counters. To keep space small, the
counters should be a fixed length, such as 4 bits.
Explain how errors can arise when using fixed-length counters. Assuming a setting
where one has at most n elements in the set at any time, m counters, k hash functions,
and counters with b bits, explain how to bound the probability that an error occurs over
the course of t insertions or deletions.

Exercise 5.25: Suppose that you built a Bloom filter for a dictionary of words with
m = 2b bits. A co-worker building an application wants to use your Bloom filter but
has only 2b−1 bits available. Explain how your colleague can use your Bloom filter to
avoid rebuilding a new Bloom filter using the original dictionary of words.

Exercise 5.26: For the leader election problem alluded to in Section 5.5.4, we have n
users, each with an identifier. The hash function takes as input the identifier and outputs
a b-bit hash value, and we assume that these values are independent and uniformly
distributed. Each user hashes its identifier, and the leader is the user with the smallest
132
5.8 an exploratory assignment

hash value. Give lower and upper bounds on the number of bits b necessary to ensure
that a unique leader is successfully chosen with probability p. Make your bounds as
tight as possible.

Exercise 5.27: Consider Algorithm 5.2, the modified algorithm for finding Hamilto-
nian cycles. We have shown that the algorithm can be applied to find a Hamiltonian
cycle with high probability in a graph chosen randomly from Gn,p , when p is known
and sufficiently large, by initially placing edges in the edge lists appropriately. Argue
that the algorithm can similarly be applied to find a Hamiltonian cycle with high prob-
ability on a graph chosen randomly from Gn,N when N = c1 n ln n for a suitably large
constant c1 . Argue also that the modified algorithm can be applied even when p is not
known in advance as long as p is at least c2 ln n/n for a suitably large constant c2 .

5.8. An Exploratory Assignment

Part of the research process in random processes is first to understand what is going on
at a high level and then to use this understanding in order to develop formal mathemat-
ical proofs. In this assignment, you will be given several variations on a basic random
process. To gain insight, you should perform experiments based on writing code to
simulate the processes. (The code should be very short, a few pages at most.) After
the experiments, you should use the results of the simulations to guide you to make
conjectures and prove statements about the processes. You can apply what you have
learned up to this point, including probabilistic bounds and analysis of balls-and-bins
problems.
Consider a complete binary tree with N = 2n − 1 nodes. Here n is the depth of the
tree. Initially, all nodes are unmarked. Over time, via processes that we shall describe,
nodes becomes marked.
All of the processes share the same basic form. We can think of the nodes as having
unique identifying numbers in the range of [1, N]. Each unit of time, I send you the
identifier of a node. When you receive a sent node, you mark it. Also, you invoke the
following marking rule, which takes effect before I send out the next node.
r If a node and its sibling are marked, its parent is marked.
r If a node and its parent are marked, the other sibling is marked.

The marking rule is applied recursively as much as possible before the next node is
sent. For example, in Figure 5.3, the marked nodes are filled in. The arrival of the node
labeled by an X will allow you to mark the remainder of the nodes, as you apply the
marking rule first up and then down the tree. Keep in mind that you always apply the
marking rule as much as possible.
Now let us consider the different ways in which I might be sending you the nodes.
Process 1: Each unit of time, I send the identifier of a node chosen independently and
uniformly at random from all of the N nodes. Note that I might send you a node that
is already marked, and in fact I may send a useless node that I have already sent.
133
balls, bins, and random graphs

Figure 5.3: The arrival of X causes all other nodes to be marked.

Process 2: Each unit of time I send the identifier of a node chosen uniformly at random
from those nodes that I have not yet sent. Again, a node that has already been marked
might arrive, but each node will be sent at most once.
Process 3: Each unit of time I send the identifier of a node chosen uniformly at random
from those nodes that you have not yet marked.
We want to determine how many time steps are needed before all the nodes are
marked for each of these processes. Begin by writing programs to simulate the sending
processes and the marking rule. Run each process ten times for each value of n in the
range [10, 20]. Present the data from your experiments in a clear, easy-to-read fashion
and explain your data suitably. A tip: You may find it useful to have your program print
out the last node that was sent before the tree became completely marked.
1. For the first process, prove that the expected number of nodes sent is (N log N).
How well does this match your simulations?
2. For the second process, you should find that almost all N nodes must be sent
√ before
the tree is marked. Show that, with constant probability, at least N − 2 N nodes
must be sent.
3. The behavior of the third process might seem a bit unusual. Explain it with a proof.
After answering these questions, you may wish to consider other facts you could prove
about these processes.

134
chapter six
The Probabilistic Method

The probabilistic method is a way of proving the existence of objects. The under-
lying principle is simple: to prove the existence of an object with certain properties,
we demonstrate a sample space of objects in which the probability is positive that a
randomly selected object has the required properties. If the probability of selecting an
object with the required properties is positive, then the sample space must contain such
an object, and therefore such an object exists. For example, if there is a positive proba-
bility of winning a million-dollar prize in a raffle, then there must be at least one raffle
ticket that wins that prize.
Although the basic principle of the probabilistic method is simple, its application to
specific problems often involves sophisticated combinatorial arguments. In this chapter
we study a number of techniques for constructing proofs based on the probabilistic
method, starting with simple counting and averaging arguments and then introducing
two more advanced tools, the Lovász local lemma and the second moment method.
In the context of algorithms we are generally interested in explicit constructions
of objects, not merely in proofs of existence. In many cases the proofs of existence
obtained by the probabilistic method can be converted into efficient randomized con-
struction algorithms. In some cases, these proofs can be converted into efficient deter-
ministic construction algorithms; this process is called derandomization, since it con-
verts a probabilistic argument into a deterministic one. We give examples of both
randomized and deterministic construction algorithms arising from the probabilistic
method.

6.1. The Basic Counting Argument

To prove the existence of an object with specific properties, we construct an appropriate

probability space S of objects and then show that the probability that an object in S
with the required properties is selected is strictly greater than 0.
For our first example, we consider the problem of coloring the edges of a graph
with two colors so that there are no large cliques with all edges having the same color.
135
the probabilistic method

Let Kn be a complete graph (with all n2 edges) on n vertices. A clique of k vertices in
Kn is a complete subgraph Kk .
k
Theorem 6.1: If nk 2− 2 +1 < 1 then it is possible to color the edges of Kn with two
colors so that it has no monochromatic Kk subgraph.

Proof: Define a sample space nconsisting of all possible colorings of the edges of Kn
using two colors. There are 2 2 possible colorings, so if one is chosen uniformly n at

random then the probability of choosing each coloring in our probability space is 2− 2 .
A nice way to think about this probability space is: if we color each edge of the graph
independently, with each edge taking each of the two colors with probability 1/2, then
we obtain a random coloring chosen uniformly from this sample space. That is, we flip
an independent fair coin to determine the color of each edge.
n
Fix an arbitrary
n ordering of all of the k
different k-vertex cliques of Kn , and for
i = 1, . . . , k let Ai be the event that
clique i is monochromatic. Once the first edge
in clique i is colored, the remaining 2k − 1 edges must all be given the same color. It
follows that

− k
+1
Pr(Ai ) = 2 2
.

Using a union bound then yields

⎛ n ⎞ n

⎜ ⎟
k k
n − 2k +1
Pr ⎝ Ai ⎠ ≤ Pr(Ai ) = 2 < 1,
i=1 i=1
k

where the last inequality follows from the assumptions of the theorem. Hence
⎛ n ⎞ ⎛ n ⎞
⎜
k
⎟ ⎜ ⎟
k

Pr ⎝ Ai ⎠ = 1 − Pr ⎝ Ai ⎠ > 0.
i=1 i=1

Since the probability of choosing a coloring with no monochromatic k-vertex clique

from our sample space is strictly greater than 0, there must exist a coloring with no
monochromatic k-vertex clique.

As an example, consider whether the edges of K1000 can be 2-colored in such a way
that there is no monochromatic K20 . Our calculations are simplified if we note that, for
n ≤ 2k/2 and k ≥ 3,

n − 2k +1 nk
2 ≤ 2−(k(k−1)/2)+1
k k!
2k/2+1
≤
k!
< 1.

Observing that for our example n = 1000 ≤ 210 = 2k/2 , we see that by Theorem 6.1
there exists a 2-coloring of the edges of K1000 with no monochromatic K20 .
136
6.2 the expectation argument

Can we use this proof to design an efficient algorithm to construct such a coloring?
Let us consider a general approach that gives a randomized construction algorithm.
First, we require that we can efficiently sample a coloring from the sample space. In
this case sampling is easy, because we can simply color each edge independently with
a randomly chosen color. In general, however, there might not be an efficient sampling
algorithm.
If we have an efficient sampling algorithm, the next question is: How many sam-
ples must we generate before obtaining a sample that satisfies our requirements? If
the probability of obtaining a sample with the desired properties is p and if we sam-
ple independently at each trial, then the number of samples needed before finding a
sample with the required properties is a geometric random variable with expectation
1/p. Hence we need that 1/p be polynomial in the problem size in order to have an
algorithm that finds a suitable sample in polynomial expected time.
If p = 1 − o(1), then sampling once gives a Monte Carlo construction algorithm
that is incorrect with probability o(1). In our specific example of finding a coloring on
a graph of 1000 vertices with no monochromatic K20 , we know that the probability that
a random coloring has a monochromatic K20 is at most
220/2+1
< 8.5 · 10−16 .
20!
Hence we have a Monte Carlo algorithm with a small probability of failure.
If we want a Las Vegas algorithm – that is, one that always gives a correct construc-
tion – then we need a third ingredient. We require a polynomial time procedure for
verifying that a sample object satisfies the requirements; then we can test samples until
we find one that does so. An upper bound on the expected time for this construction
can be found by multiplying together the expected number of samples 1/p by the sum
of an upper bound on the time to generate each sample and an upper bound on the time
to check each sample.1 For the coloring problem, there is a polynomial time verifica-
tion algorithm when k is a constant: simply check all nk cliques and make sure they
are not monochromatic. It does not seem that this approach can be extended to yield
polynomial time algorithms when k grows with n.

6.2. The Expectation Argument

As we have seen, in order to prove that an object with certain properties exists, we
can design a probability space from which an element chosen at random yields an
object with the desired properties with positive probability. A similar and sometimes
easier approach for proving that such an object exists is to use an averaging argument.
The intuition behind this approach is that, in a discrete probability space, a random
variable must with positive probability assume at least one value that is no greater
than its expectation and at least one value that is not smaller than its expectation.

1 Sometimes the time to generate or check a sample may itself be a random variable. In this case, Wald’s equation
(discussed in Chapter 13) may apply.

137
the probabilistic method

For example, if the expected value of a raffle ticket is $3, then there must be at least
one ticket that ends up being worth no more than $3 and at least one that ends up being
worth no less than $3.
More formally, we have the following lemma.
Lemma 6.2: Suppose we have a probability space S and a random variable X defined
on S such that E[X] = μ. Then Pr(X ≥ μ) > 0 and Pr(X ≤ μ) > 0.
Proof: We have

μ = E[X] = x Pr(X = x),
x

where the summation ranges over all values in the range of X. If Pr(X ≥ μ) = 0, then

μ= x Pr(X = x) = x Pr(X = x) < μ Pr(X = x) = μ,
x x<μ x<μ

giving a contradiction. Similarly, if Pr(X ≤ μ) = 0 then

μ= x Pr(X = x) = x Pr(X = x) > μ Pr(X = x) = μ,
x x>μ x>μ

again yielding a contradiction.

Thus, there must be at least one instance in the sample space of S for which the value
of X is at least μ and at least one instance for which the value of X is no greater than μ.

6.2.1. Application: Finding a Large Cut

We consider the problem of finding a large cut in an undirected graph. A cut is a parti-
tion of the vertices into two disjoint sets, and the value of a cut is the weight of all edges
crossing from one side of the partition to the other. Here we consider the case where
all edges in the graph have the same weight 1. The problem of finding a maximum cut
is NP-hard. Using the probabilistic method, we show that the value of the maximum
cut must be at least 1/2 the number of edges in the graph.
Theorem 6.3: Given an undirected graph G with m edges, there is a partition of V
into two disjoint sets A and B such that at least m/2 edges connect a vertex in A to a
vertex in B. That is, there is a cut with value at least m/2.
Proof: Construct sets A and B by randomly and independently assigning each vertex
to one of the two sets. Let e1 , . . . , em be an arbitrary enumeration of the edges of G.
For i = 1, . . . , m, define Xi such that

1 if edge i connects A to B,
Xi =
0 otherwise.
The probability that edge ei connects a vertex in A to a vertex in B is 1/2, and thus
1
E[Xi ] = .
2
138
6.2 the expectation argument

Let C(A, B) be a random variable denoting the value of the cut corresponding to the
sets A and B. Then
m
m
1 m
E[C(A, B)] = E Xi = E[Xi ] = m · = .
i=1 i=1
2 2

Since the expectation of the random variable C(A, B) is m/2, there exists a partition A
and B with at least m/2 edges connecting the set A to the set B.

We can transform this argument into an efficient algorithm for finding a cut with value
at least m/2. We first show how to obtain a Las Vegas algorithm. In Section 6.3, we
show how to construct a deterministic polynomial time algorithm.
It is easy to randomly choose a partition as described in the proof. The expectation
argument does not give a lower bound on the probability that a random partition has a
cut of value at least m/2. To derive such a bound, let

m
p = Pr C(A, B) ≥ ,
2
and observe that C(A, B) ≤ m. Then
m
= E[C(A, B)]
2
= i Pr(C(A, B) = i) + i Pr(C(A, B) = i)
i<m/2 i≥m/2

m
≤ (1 − p) − 1 + pm,
2
which implies that
1
p≥ .
m/2 + 1
The expected number of samples before finding a cut with value at least m/2 is therefore
just m/2 + 1. Testing to see if the value of the cut determined by the sample is at least
m/2 can be done in polynomial time simply by counting the edges crossing the cut. We
therefore have a Las Vegas algorithm for finding the cut.

6.2.2. Application: Maximum Satisfiability

We can apply a similar argument to the maximum satisfiability (MAXSAT) problem.
In a logical formula, a literal is either a Boolean variable or the negation of a Boolean
variable. We use x to denote the negation of the variable x. A satisfiability (SAT) prob-
lem, or a SAT formula, is a logical expression that is the conjunction (AND) of a set
of clauses, where each clause is the disjunction (OR) of literals. For example, the fol-
lowing expression is an instance of SAT:

(x1 ∨ x2 ∨ x3 ) ∧ (x1 ∨ x3 ) ∧ (x1 ∨ x2 ∨ x4 ) ∧ (x4 ∨ x3 ) ∧ (x4 ∨ x1 ).

139
the probabilistic method

A solution to an instance of a SAT formula is an assignment of the variables to the

values True and False so that all the clauses are satisfied. That is, there is at least one
true literal in each clause. For example, assigning x1 to True, x2 to False, x3 to False,
and x4 to True satisfies the preceding SAT formula. In general, determining if a SAT
formula has a solution is NP-hard.
A related goal, given a SAT formula, is satisfying as many of the clauses as pos-
sible. In what follows, let us assume that no clause contains both a variable and its
complement, since in this case the clause is always satisfied.
Theorem 6.4: Given a set of m clauses, let ki be the number of literals in the ith clause
for i = 1, . . . , m. Let k = minm
i=1 ki . Then there is a truth assignment that satisfies at
least

m
(1 − 2−ki ) ≥ m(1 − 2−k )
i=1

clauses.
Proof: Assign values independently and uniformly at random to the variables. The
probability that the ith clause with ki literals is satisfied is at least (1 − 2−ki ). The
expected number of satisfied clauses is therefore at least

m
(1 − 2−ki ) ≥ m(1 − 2−k ),
i=1

and there must be an assignment that satisfies at least that many clauses.

The foregoing argument can also be easily transformed into an efficient randomized
algorithm; the case where all ki = k is left as Exercise 6.1.

6.3. Derandomization Using Conditional Expectations

The probabilistic method can yield insight into how to construct deterministic algo-
rithms. As an example, we apply the method of conditional expectations in order to
derandomize the algorithm of Section 6.2.1 for finding a large cut.
Recall that we find a partition of the n vertices V of a graph into sets A and B by
placing each vertex independently and uniformly at random in one of the two sets. This
gives a cut with expected value E[C(A, B)] ≥ m/2. Now imagine placing the vertices
deterministically, one at a time, in an arbitrary order v1 , v2 , . . . , vn . Let xi be the set
where vi is placed (so xi is either A or B). Suppose that we have placed the first k
vertices, and consider the expected value of the cut if the remaining vertices are then
placed independently and uniformly into one of the two sets. We write this quantity
as E[C(A, B) | x1 , x2 , . . . , xk ]; it is the conditional expectation of the value of the cut
given the locations x1 , x2 , . . . , xk of the first k vertices. We show inductively how to
place the next vertex so that
E[C(A, B) | x1 , x2 , . . . , xk ] ≤ E[C(A, B) | x1 , x2 , . . . , xk+1 ].
140
6.3 derandomization using conditional expectations

It follows that
E[C(A, B)] ≤ E[C(A, B) | x1 , x2 , . . . , xn ].
The right-hand side is the value of the cut determined by our placement algorithm,
since if x1 , x2 , . . . , xn are all determined then we have a cut of the graph. Hence our
algorithm returns a cut whose value is at least E[C(A, B)] ≥ m/2.
The base case in the induction is
E[C(A, B) | x1 ] = E[C(A, B)],
which holds by symmetry because it does not matter where we place the first vertex.
We now prove the inductive step, that
E[C(A, B) | x1 , x2 , . . . , xk ] ≤ E[C(A, B) | x1 , x2 , . . . , xk+1 ]. (6.1)
Consider placing vk+1 randomly, so that it is placed in A or B with probability 1/2 each,
and let Yk+1 be a random variable representing the set where it is placed. Then
1
E[C(A, B) | x1 , x2 , . . . , xk ] = E[C(A, B) | x1 , x2 , . . . , xk , Yk+1 = A]
2
1
+ E[C(A, B) | x1 , x2 , . . . , xk , Yk+1 = B].
2
It follows that

max E[C(A, B) | x1 , x2 , . . . , xk , Yk+1 = A], E[C(A, B) | x1 , x2 , . . . , xk , Yk+1 = B]
≥ E[C(A, B) | x1 , x2 , . . . , xk ].
Therefore, all we have to do is compute the two quantities E[C(A, B) | x1 , x2 , . . . ,
xk , Yk+1 = A] and E[C(A, B) | x1 , x2 , . . . , xk , Yk+1 = B] and then place the vk+1 in the
set that yields the larger expectation. Once we do this, we will have a placement satis-
fying
E[C(A, B) | x1 , x2 , . . . , xk ] ≤ E[C(A, B) | x1 , x2 , . . . , xk+1 ].
To compute E[C(A, B) | x1 , x2 , . . . , xk , Yk+1 = A], note that the conditioning gives
the placement of the first k + 1 vertices. We can therefore compute the number of edges
among these vertices that contribute to the value of the cut. For all other edges, the
probability that it will later contribute to the cut is 1/2, since this is the probability
its two endpoints end up on different sides of the cut. By linearity of expectations,
E[C(A, B) | x1 , x2 , . . . , xk , Yk+1 = A] is the number of edges crossing the cut whose
endpoints are both among the first k + 1 vertices, plus half of the remaining edges. This
is easy to compute in linear time. The same is true for E[C(A, B) | x1 , x2 , . . . , xk , Yk+1 =
B].
In fact, from this argument, we see that the larger of the two quantities is determined
just by whether vk+1 has more neighbors in A or in B. All edges that do not have vk+1
as an endpoint contribute the same amount to the two expectations. Our derandomized
algorithm therefore has the following simple form: Take the vertices in some order.
Place the first vertex arbitrarily in A. Place each successive vertex to maximize the
number of edges crossing the cut. Equivalently, place each vertex on the side with
141
the probabilistic method

fewer neighbors, breaking ties arbitrarily. This is a simple greedy algorithm, and our
analysis shows that it always guarantees a cut with at least m/2 edges.

6.4. Sample and Modify

Thus far we have used the probabilistic method to construct random structures with the
desired properties directly. In some cases it is easier to work indirectly, breaking the
argument into two stages. In the first stage we construct a random structure that does not
have the required properties. In the second stage we then modify the random structure
so that it does have the required property. We give two examples of this sample-and-
modify technique.

6.4.1. Application: Independent Sets

An independent set in a graph G is a set of vertices with no edges between them. Find-
ing the largest independent set in a graph is an NP-hard problem. The following the-
orem shows that the probabilistic method can yield bounds on the size of the largest
independent set of a graph.

Theorem 6.5: Let G = (V, E ) be a connected graph on n vertices with m ≥ n/2 edges.
Then G has an independent set with at least n2 /4m vertices.

Proof: Let d = 2m/n ≥ 1 be the average degree of the vertices in G. Consider the
following randomized algorithm.

1. Delete each vertex of G (together with its incident edges) independently with prob-
ability 1 − 1/d.
2. For each remaining edge, remove it and one of its adjacent vertices.

The remaining vertices form an independent set, since all edges have been removed.
This is an example of the sample-and-modify technique. We first sample the vertices,
and then we modify the remaining graph.
Let X be the number of vertices that survive the first step of the algorithm. Since
the graph has n vertices and since each vertex survives with probability 1/d, it follows
that
n
E[X] = .
d
Let Y be the number of edges that survive the first step. There are nd/2 edges in
the graph, and an edge survives if and only if its two adjacent vertices survive.
Thus

nd 1 2 n
E[Y ] = = .
2 d 2d
The second step of the algorithm removes all the remaining edges and at most Y
vertices. When the algorithm terminates, it outputs an independent set of size at least
142
6.5 the second moment method

X − Y , and
n n n
E[X − Y ] =− = .
d 2d 2d
The expected size of the independent set generated by the algorithm is at least n/2d,
so the graph has an independent set with at least n/2d = n2 /4m vertices.

6.4.2. Application: Graphs with Large Girth

As another example we consider the girth of a graph, which is the length of its smallest
cycle. Intuitively we expect dense graphs to have small girth. We can show, however,
that there are dense graphs with relatively large girth.
Theorem 6.6: For any integer k ≥ 3, for n sufficiently large there is a graph with n
nodes, at least 14 n1+1/k edges, and girth at least k.
Proof: We first sample a random graph G ∈ Gn,p with p = n1/k−1 . Let X be the number
of edges in the graph. Then

n 1 1 1/k+1
E[X] = p = 1− n .
2 2 n
Let Y be the number of cycles in the graph of length at most k − 1. Any specific
possible cycle of length i, where 3 ≤ i ≤ k − 1, occurs with probability pi . Also, there
n (i−1)!
are i 2
possible cycles of length i; to see this, first consider choosing the i vertices,
then consider the possible orders, and finally keep in mind that reversing the order
yields the same cycle. Hence,
k−1
n (i − 1)! i i i i/k
k−1 k−1
E[Y ] = p ≤ np = n < kn(k−1)/k .
i=3
i 2 i=3 i=3

We modify the original randomly chosen graph G by eliminating one edge from each
cycle of length up to k − 1. The modified graph therefore has girth at least k. When n
is sufficiently large, the expected number of edges in the resulting graph is

1 1 1/k+1 1
E[X − Y ] ≥ 1− n − kn(k−1)/k ≥ n1/k+1 .
2 n 4
Hence there exists a graph with at least 14 n1+1/k edges and girth at least k.

6.5. The Second Moment Method

The second moment method is another useful way to apply the probabilistic method.
The standard approach typically makes use of the following inequality, which is easily
derived from Chebyshev’s inequality.
Theorem 6.7: If X is an integer-valued random variable, then
Var[X]
Pr(X = 0) ≤ . (6.2)
(E[X])2
143
the probabilistic method

Proof:
Var[X]
Pr(X = 0) ≤ Pr(|X − E[X]| ≥ E[X]) ≤ .
(E[X])2

6.5.1. Application: Threshold Behavior in Random Graphs

The second moment method can be used to prove the threshold behavior of certain
random graph properties. That is, in the Gn,p model it is often the case that there is a
threshold function f such that: (a) when p is just less than f (n), almost no graph has the
desired property; whereas (b) when p is just larger than f (n), almost every graph has
the desired property. We present here a relatively simple example.
Theorem 6.8: In Gn,p , suppose that p = f (n), where f (n) = o(n−2/3 ). Then, for any
ε > 0 and for sufficiently large n, the probability that a random graph chosen from Gn,p
has a clique of four or more vertices is less than ε. Similarly, if f (n) = ω(n−2/3 ) then,
for sufficiently large n, the probability that a random graph chosen from Gn,p does not
have a clique with four or more vertices is less than ε.
Proof: We first consider the case in which p = f (n) and f (n) = o(n−2/3 ). Let
C1 , . . . , C(n ) be an enumeration of all the subsets of four vertices in G. Let
4

1 if Ci is a 4-clique,
Xi =
0 otherwise.
Let
(n4 )

X= Xi ,
i=1

so that

n
E[X] = p6 .
4
In this case E[X] = o(1), which means that E[X] < ε for sufficiently large n. Since X is
a nonnegative integer-valued random variable, it follows that Pr(X ≥ 1) ≤ E[X] < ε.
Hence, the probability that a random graph chosen from Gn,p has a clique of four or
more vertices is less than ε.
We now consider the case when p = f (n) and f (n) = ω(n−2/3 ). In this case,
E[X] → ∞ as n grows large. This in itself is not sufficient to conclude that, with high
probability, a graph chosen random from Gn,p has a clique of at least four vertices. We
can, however, use Theorem 6.7 to prove that Pr(X = 0) = o(1) in this case. To do so
we must show that Var[X] = o((E[X])2 ). Here we shall bound the variance directly;
an alternative approach is given as Exercise 6.12.
We begin with the following useful formula.

Lemma 6.9: Let Yi , i = 1, . . . , m, be 0–1 random variables, and let Y = mi=1 Yi . Then

Var[Y ] ≤ E[Y ] + Cov(Yi , Y j ).
1≤i, j≤m;i= j

144
6.6 the conditional expectation inequality

Proof: For any sequence of random variables Y1 , . . . , Ym ,

m
m
Var Yi = Var[Yi ] + Cov(Yi , Y j ).
i=1 i=1 1≤i, j≤m;i= j

This is the generalization of Theorem 3.2 to m variables.

When Yi is a 0–1 random variable, E[Yi2 ] = E[Yi ] and so

Var[Yi ] = E Yi2 − (E[Yi ])2 ≤ E[Yi ],
giving the lemma.

We wish to compute
⎡ n ⎤
(4 )
Var[X] = Var ⎣ Xi ⎦ .
i=1

Applying Lemma 6.9, we see that we need to consider the covariance of the Xi . If
|Ci ∩ C j | = 0 then the corresponding cliques are disjoint, and it follows that Xi and X j
are independent. Hence, in this case, E[Xi X j ] − E[Xi ]E[X j ] = 0. The same is true if
|Ci ∩ C j | = 1.
If |Ci ∩ C j | = 2, then the corresponding cliques share one edge. For both cliques to
be in the graph, the eleven corresponding edges must appear n in the graph. Hence, in
this case E[Xi X j ] − E[Xi ]E[X j ] ≤ E[Xi X j ] ≤ p . There are 6 ways to choose the six
11
6
vertices and 2;2;2 ways to split them into Ci and C j (because we choose two vertices
for Ci ∩ C j , two for Ci alone, and two for C j alone).
If |Ci ∩ C j | = 3, then the corresponding cliques share three edges. For both cliques
to be in the graph, the nine corresponding edges must appear in the graph. Hence, in
this case E[XiX j ] − E[Xi ]E[X j ] ≤ E[Xi X j ] ≤ p9 . There are n5 ways to choose the five
5
vertices, and 3;1;1 ways to split them
into Ci and C j .
Finally, recall again that E[X] = n4 p6 and p = f (n) = ω(n−2/3 ). Therefore,

n n 6 n 5
Var[X] ≤ p+
6
p +
11
p9 = o(n8 p12 ) = o((E[X])2 ),
4 6 2; 2; 2 5 3; 1; 1
since
2
n
(E[X]) =
2
p6 = (n8 p12 ).
4
Theorem 6.7 now applies, showing that Pr(X = 0) = o(1) and thus the second part of
the theorem.

6.6. The Conditional Expectation Inequality

For a sum of Bernoulli random variables, we can derive an alternative to the second
moment method that is often easier to apply.

145
the probabilistic method

n
Theorem 6.10: Let X = i=1 Xi , where each Xi is a 0–1 random variable. Then
n
Pr(Xi = 1)
Pr(X > 0) ≥ . (6.3)
i=1
E[X | Xi = 1]

Notice that the Xi need not be independent for Eqn. (6.3) to hold.
Proof: Let Y = 1/X if X > 0, with Y = 0 otherwise. Then
Pr(X > 0) = E[XY ].
However,
n

E[XY ] = E XiY
i=1

n
= E[XiY ]
i=1
n

= E[XiY | Xi = 1] Pr(Xi = 1) + E[XiY | Xi = 0] Pr(Xi = 0)
i=1
n
= E[Y | Xi = 1] Pr(Xi = 1)
i=1
n
= E[1/X | Xi = 1] Pr(Xi = 1)
i=1
n
Pr(Xi = 1)
≥ .
i=1
E[X | Xi = 1]

The key step is from the third to the fourth line, where we use conditional expectations
in a fruitful way by taking advantage of the fact that E[XiY | Xi = 0] = 0. The last line
makes use of Jensen’s inequality, with the convex function f (x) = 1/x.
We can use Theorem 6.10 to give an alternate proof of Theorem 6.8. Specifically, if
p = f (n) = ω(n−2/3 ), we use Theorem 6.10 to show that, for any constant ε > 0 and
for sufficiently large n, the probability that a random graph chosen from Gn,p does not
have a clique with four or more vertices is less than ε.
(n4 )
As in the proof of Theorem 6.8, let X = i=1 Xi , where Xi is 1 if the subset of four
vertices Ci is a 4-clique and 0 otherwise. For a specific X j , we have Pr(X j = 1) = p6 .
Using the linearity of expectations, we compute
⎡ n ⎤
(4 ) (n4 )

E[X | X j = 1] = E ⎣ Xi X j = 1⎦ = E[Xi | X j = 1].
i=1 i=1

Conditioning on X j = 1, we now compute E[Xi | X j = 1] by using that, for a 0–1 ran-

dom variable,
E[Xi | X j = 1] = Pr(Xi = 1 | X j = 1).
146
6.7 the lovász local lemma

There are n−44
sets of vertices Ci that do not intersect C j . Each corresponding
Xi
is 1 with probability p6 . Similarly, Xi = 1 with probability p6 for the 4 n−4
3
sets C i that
have one vertex in common with C j .
For the remaining cases, we have Pr(Xi = 1 | X j = 1) = p5 for the 6 n−4 2
sets
n−4Ci
that have two vertices in common with C j and Pr(Xi = 1 | X j = 1) = p for the 4 1
3

sets Ci that have three vertices in common with C j . Summing, we have

(n4 )

E[X | X j = 1] = E[Xi | X j = 1]
i=1

n−4 n−4 n−4 n−4
=1+ p +46
p +6
6
p +4
5
p3 .
4 3 2 1

Applying Theorem 6.10 yields

n
p6
Pr(X > 0) ≥ 4
,
1+ n−4
4
p6 + 4 n−4
3
p6 + 6 n−4
2
p5 + 4 n−4
1
p3

which approaches 1 as n grows large when p = f (n) = ω(n−2/3 ).

6.7. The Lovász Local Lemma

One of the most elegant and useful tools in applying the probabilistic method is the
Lovász Local Lemma. Let E1 , . . . , En be a set of bad events in some probability space.
We want to show that there is an element in the sample space that is not included in
any of the bad events.
This would be easy to do if the events were mutually independent. Recall
that events E1 , E2 , . . . , En are mutually independent if and only if, for any subset
I ⊆ [1, n],

Pr Ei = Pr(Ei ).
i∈I i∈I

Also, if E1 , . . . , En are mutually independent then so are Ē1 , . . . , Ēn . (This was left as
Exercise 1.20.) If Pr(Ei ) < 1 for all i, then
n

n
Pr Ēi = Pr(Ēi ) > 0,
i=1 i=1

and there is an element of the sample space that is not included in any bad event.
Mutual independence is too much to ask for in many arguments. The Lovász local
lemma generalizes the preceding argument to the case where the n events are not mutu-
ally independent but the dependency is limited. Specifically, following from the defi-
nition of mutual independence, we say that an event En+1 is mutually independent of
147
the probabilistic method

the events E1 , E2 , . . . , En if, for any subset I ⊆ [1, n],

⎛ ⎞

Pr ⎝En+1 E j ⎠ = Pr(En+1 ).
j∈I

The dependency between events can be represented in terms of a dependency graph.

Definition 6.1: A dependency graph for a set of events E1 , . . . , En is a graph G =
(V, E ) such that V = {1, . . . , n} and, for i = 1, . . . , n, event Ei is mutually independent
of the events {E j | (i, j) ∈
/ E}. The degree of the dependency graph is the maximum
degree of any vertex in the graph.
We discuss first a special case, the symmetric version of the Lovász Local Lemma,
which is more intuitive and is sufficient for most algorithmic applications.
Theorem 6.11 [Lovász Local Lemma]: Let E1 , . . . , En be a set of events, and assume
that the following hold:
1. for all i, Pr(Ei ) ≤ p;
2. the degree of the dependency graph given by E1 , . . . , En is bounded by d;
3. 4d p ≤ 1.
Then
n

Pr Ēi > 0.
i=1

Proof: Let S ⊂ {1, . . . , n}. We prove by induction on s = 0, . . . , n − 1 that, if |S| ≤ s,

then for all k ∈
/ S we have
⎛ ⎞

Pr ⎝Ek Ē j ⎠ ≤ 2p.
j∈S

For this expression to be well-defined when S is not empty, we need Pr j∈S Ē j > 0.
The base case s = 0 follows from the assumption
that Pr(Ek ) ≤ p. To perform the
inductive step, we first show that Pr Ē
j∈S j > 0. This is true when s = 1, because
Pr(Ē j ) ≥ 1 − p > 0. For s > 1, without loss of generality let S = {1, 2, . . . , s}. Then
s ⎛ ⎞

s

i−1
Pr Ēi = Pr ⎝Ēi Ē j ⎠
i=1 i=1 j=1
⎛ ⎛ ⎞⎞

s

i−1
= ⎝1 − Pr ⎝Ei Ē j ⎠⎠
i=1 j=1

s
≥ (1 − 2p) > 0.
i=1

In obtaining the last line we used the induction hypothesis.

148
6.7 the lovász local lemma

For the rest of the induction, let S1 = { j ∈ S | (k, j) ∈ E} and S2 = S − S1 . If S2 = S

then Ek is mutually independent of the events Ēi , i ∈ S, and
⎛ ⎞

Pr ⎝Ek Ē j ⎠ = Pr(Ek ) ≤ p.
j∈S

We continue with the case |S2 | < s. It will be helpful to introduce the following nota-
tion. Let FS be defined by

FS = Ē j ,
j∈S

and similarly define FS1 and FS2 . Notice that FS = FS1 ∩ FS2 .
Applying the definition of conditional probability yields

Pr(Ek ∩ FS )
Pr(Ek | FS ) = . (6.4)
Pr(FS )

Applying the definition of conditional probability to the numerator of Eqn. (6.4), we

obtain

Pr(Ek ∩ FS ) = Pr(Ek ∩ FS1 ∩ FS2 )

= Pr(Ek ∩ FS1 | FS2 ) Pr(FS2 ).

The denominator can be written as

Pr(FS ) = Pr(FS1 ∩ FS2 )

= Pr(FS1 | FS2 ) Pr(FS2 ).

Canceling the common factor, which we have already shown to be nonzero, yields

Pr(Ek ∩ FS1 | FS2 )

Pr(Ek | FS ) = . (6.5)
Pr(FS1 | FS2 )

Note that Eqn. (6.5) is valid even when S2 = ∅.

Since the probability of an intersection of events is bounded by the probability of
any one of the events and since Ek is independent of the events in S2 , we can bound the
numerator of Eqn. (6.5) by

Pr(Ek ∩ FS1 | FS2 ) ≤ Pr(Ek | FS2 ) = Pr(Ek ) ≤ p.

Because |S2 | < |S| = s, we can apply the induction hypothesis to

⎛ ⎞

Pr(Ei | FS2 ) = Pr ⎝Ei Ē j ⎠ .

j∈S2

149
the probabilistic method

Using also the fact that |S1 | ≤ d, we establish a lower bound on the denominator of
Eqn. (6.5) as follows:
⎛ ⎞

Pr(FS1 | FS2 ) = Pr ⎝ Ēi Ē j ⎠

i∈S1 j∈S2
⎛ ⎞

≥ 1− Pr ⎝Ei Ē j ⎠
i∈S1 j∈S2

≥ 1− 2p
i∈S1
≥ 1 − 2pd
1
≥ .
2
Using the upper bound for the numerator and the lower bound for the denominator,
we prove the induction:
Pr(Ek ∩ FS1 | FS2 )
Pr(Ek | FS ) =
Pr(FS1 | FS2 )
p
≤ = 2p.
1/2
The theorem follows from
n ⎛ ⎞

n

i−1
Pr Ēi = Pr ⎝Ēi Ē j ⎠
i=1 i=1 j=1
⎛ ⎛ ⎞⎞

n

i−1
= ⎝1 − Pr ⎝Ei Ē j ⎠⎠
i=1 j=1

n
≥ (1 − 2p) > 0.
i=1

6.7.1. Application: Edge-Disjoint Paths

Assume that n pairs of users need to communicate using edge-disjoint paths on a given
network. Each pair i = 1, . . . , n can choose a path from a collection Fi of m paths. We
show using the Lovász local lemma that, if the possible paths do not share too many
edges, then there is a way to choose n edge-disjoint paths connecting the n pairs.
Theorem 6.12: If any path in Fi shares edges with no more than k paths in Fj , where
i = j and 8nk/m ≤ 1, then there is a way to choose n edge-disjoint paths connecting
the n pairs.
Proof: Consider the probability space defined by each pair choosing a path indepen-
dently and uniformly at random from its set of m paths. Define Ei, j to represent the
150
6.7 the lovász local lemma

event that the paths chosen by pairs i and j share at least one edge. Since a path in Fi
shares edges with no more than k paths in Fj ,
k
p = Pr(Ei, j ) ≤ .
m
Let d be the degree of the dependency graph. Since event Ei, j is independent of all
events Ei , j when i ∈
/ {i, j} and j ∈
/ {i, j}, we have d < 2n. Since
8nk
4dp < ≤ 1,
m
all of the conditions of the Lovász Local Lemma are satisfied, proving
⎛ ⎞

Pr ⎝ Ēi, j ⎠ > 0.
i= j

Hence, there is a choice of paths such that the n paths are edge disjoint.

6.7.2. Application: Satisfiability

As a second example, we return to the satisfiability question. For the k-satisfiability (k-
SAT) problem, the formula is restricted so that each clause has exactly k literals. Again,
we assume that no clause contains both a literal and its negation, as these clauses are
trivial. We prove that any k-SAT formula in which no variable appears in too many
clauses has a satisfying assignment.
Theorem 6.13: If no variable in a k-SAT formula appears in more than T = 2k /4k
clauses, then the formula has a satisfying assignment.
Proof: Consider the probability space defined by giving a random assignment to the
variables. For i = 1, . . . , m, let Ei denote the event that the ith clause is not satisfied by
the random assignment. Since each clause has k literals,
Pr(Ei ) = 2−k .
The event Ei is mutually independent of all of the events related to clauses that do
not share variables with clause i. Because each of the k variables in clause i can appear
in no more than T = 2k /4k clauses, the degree of the dependency graph is bounded by
d ≤ kT ≤ 2k−2 .
In this case,
4dp ≤ 4 · 2k−2 2−k ≤ 1,
so we can apply the Lovász Local Lemma to conclude that
m

Pr Ēi > 0;
i=1

hence there is a satisfying assignment for the formula.

151
the probabilistic method

6.8.∗ Explicit Constructions Using the Local Lemma

The Lovász Local Lemma proves that a random element in an appropriately defined
sample space has a nonzero probability of satisfying our requirement. However, this
probability might be too small for an algorithm that is based on simple sampling. The
number of objects that we need to sample before we find an element that satisfies our
requirements might be exponential in the problem size.
In a number of interesting applications, the existential result of the Lovász Local
Lemma can be used to derive efficient construction algorithms. Although the details
differ in the specific applications, many known algorithms are based on a common two-
phase scheme. In the first phase, a subset of the variables of the problem are assigned
random values; the remaining variables are deferred to the second stage. The subset of
variables that are assigned values in the first stage is chosen so that:

1. using the Local Lemma, one can show that the random partial solution fixed in the
first phase can be extended to a full solution of the problem without modifying any
of the variables fixed in the first phase; and
2. the dependency graph H between events defined by the variables deferred to the
second phase has, with high probability, only small connected components.

When the dependency graph consists of connected components, a solution for the
variables of one component can be found independently of the other components. Thus,
the first phase of the two-phase algorithm breaks the original problem into smaller
subproblems. Each of the smaller subproblems can then be solved independently in the
second phase by an exhaustive search.

6.8.1. Application: A Satisfiability Algorithm

We demonstrate this technique in an algorithm for finding a satisfying assignment for
a k-SAT formula. The explicit construction result will be significantly weaker than the
existence result proven in the previous section. In particular, we obtain a polynomial
time algorithm only for the case when k is a constant. This result is still interesting, since
for k ≥ 3 the problem of k-satisfiability is NP-complete. For notational convenience we
treat here only the case where k is an even constant with k ≥ 12; the case where k is a
sufficiently large odd constant is similar.
Consider a k-SAT formula F, with k an even constant, such that each variable
appears in no more than T = 2αk clauses for some constant α > 0 determined in the
proof. Let x1 , . . . , x be the variables and C1 , . . . , Cm the m clauses of F.
Following the outline suggested in Section 6.8, our algorithm for finding a satisfying
assignment for F has two phases. Some of the variables are fixed at the first phase,
and the remaining variables are deferred to the second phase. While executing the first
phase, we call a clause Ci dangerous if both the following conditions hold:

1. k/2 literals of the clause Ci have been fixed; and

2. Ci is not yet satisfied.

152
∗
6.8 explicit constructions using the local lemma

Phase I can be described as follows. Consider the variables x1 , . . . , x sequentially.

If xi is not in a dangerous clause, assign it independently and uniformly at random a
value in {0, 1}.
A clause is a surviving clause if it is not satisfied by the variables fixed in phase I.
Note that a surviving clause has no more than k/2 of its variables fixed in the first
phase. A deferred variable is a variable that was not assigned a value in the first phase.
In phase II, we use exhaustive search in order to assign values to the deferred variables
and so complete a satisfying assignment for the formula.
In the next two lemmas we show that:
1. the partial solution computed in phase I can be extended to a full satisfying assign-
ment of F, and
2. with high probability, the exhaustive search in phase II is completed in time that is
polynomial in m.
Lemma 6.14: There is an assignment of values to the deferred variables such that all
the surviving clauses are satisfied.
Proof: Let H = (V, E ) be a graph on m nodes, where V = {1, . . . , m}, and let (i, j) ∈
E if and only if Ci ∩ C j = ∅. That is, H is the dependency graph for the original prob-
lem. Let H = (V , E ) be a graph with V ⊆ V and E ⊆ E such that (a) i ∈ V if and
only if Ci is a surviving clause and (b) (i, j) ∈ E if and only if Ci and C j share a deferred
variable. In the following discussion we do not distinguish between node i and clause i.
Consider the probability space defined by assigning a random value in {0, 1} inde-
pendently to each deferred variable. The assignment of values to the nondeferred vari-
ables in phase I, together with the random assignment of values to the deferred vari-
ables, defines an assignment to all the variables. For i = 1, . . . , m, let Ei be the event
that surviving clause Ci is not satisfied by this assignment. Associate the event Ei with
node i in V . The graph H is then the dependency graph for this set of events.
A surviving clause has at least k/2 deferred variables, so
p = Pr(Ei ) ≤ 2−k/2 .
A variable appears in no more than T clauses; therefore, the degree of the dependency
graph is bounded by
d = kT ≤ k2αk .
For any k ≥ 12, there is a corresponding suitably small constant α > 0 so that
4dp = 4k2αk 2−k/2 ≤ 1
and so, by the Lovász Local Lemma, there is an assignment for the deferred variables
that – together with the assignment of values to variables in phase I – satisfies the
formula.
The assignment of values to a subset of the variables in phase I partitions the problem
into as many as m independent subformulas, so that each deferred variable appears
in only one subformula. The subformulas are given by the connected components of
H . If we can show that each connected component in H has size O(log m), then each
153
the probabilistic method

subformula will have no more than O(k log m) deferred variables. An exhaustive search
of all the possible assignments for all variables in each subformula can then be done in
polynomial time. Hence we focus on the following lemma.
Lemma 6.15: All connected components in H are of size O(log m) with probability
1 − o(1).
Proof: Consider a connected component R of r vertices in H. If R is a connected com-
ponent in H , then all its r nodes are surviving clauses. A surviving clause is either
a dangerous clause or it shares at least one deferred variable with a dangerous clause
(i.e., it has a neighbor in H that is a dangerous clause). The probability that a given
clause is dangerous is at most 2−k/2 , since exactly k/2 of its variables were given ran-
dom values in phase I yet none of these values satisfied the clause. The probability that
a given clause survives is the probability that either this clause or at least one of its
direct neighbors is dangerous, which is bounded by
(d + 1)2−k/2 ,
where again d = kT > 1.
If the survival of individual clauses were independent events then we would be in
excellent shape. However, from our description here it is evident that such events are
not independent. Instead, we identify a subset of the vertices in R such that the survival
of the clauses represented by the vertices of this subset are independent events. A 4-tree
S of a connected component R in H is defined as follows:
1. S is a rooted tree;
2. any two nodes in S are at distance at least 4 in H;
3. there can be an edge in S only between two nodes with distance exactly 4 between
them in H;
4. any node of R is either in S or is at distance 3 or less from a node in S.
Considering the nodes in a 4-tree proves useful because the event that a node u in
a 4-tree survives and the event that another node v in a 4-tree survives are actually
independent. Any clause that could cause u to survive has distance at least 2 from
any clause that could cause v to survive. Clauses at distance 2 share no variables, and
hence the events that they are dangerous are independent. We can take advantage of
this independence to conclude that, for any 4-tree S, the probability that the nodes in
the 4-tree survive is at most
((d + 1)2−k/2 )|S| .
A maximal 4-tree S of a connected component R is the 4-tree with the largest possible
number of vertices. Since the degree of the dependency graph is bounded by d, there
are no more than
d + d(d − 1) + d(d − 1)(d − 1) ≤ d 3 − 1
nodes at distance 3 or less from any given vertex. We therefore claim that a maximal
4-tree of R must have at least r/d 3 vertices. Otherwise, when we consider the vertices
of the maximal 4-tree S and all neighbors within distance 3 or less of these vertices,
we obtain fewer than r vertices. Hence there must be a vertex of distance at least 4
154
6.9 lovász local lemma: the general case

from all vertices in S. If this vertex has distance exactly 4 from some vertex in S, then
it can be added to S and thus S is not maximal, yielding a contradiction. If its dis-
tance is larger than 4 from all vertices in S, consider any path that brings it closer to S;
such a path must eventually pass through a vertex of distance at least 4 from all ver-
tices in S and of distance 4 from some vertex in S, again contradicting the maximality
of S.
To show that with probability 1 − o(1) there is no connected component R of size
r ≥ c log2 m for some constant c in H , we show that there is no 4-tree of H of size
r/d 3 that survives with probability 1 − o(1). Since a surviving connected component
R would have a maximal 4-tree of size r/d 3 , the absence of such a 4-tree implies the
absence of such a component.
We need to count the number of 4-trees of size s = r/d 3 in H. We can choose the
root of the 4-tree in m ways. A tree with root v is uniquely defined by an Eulerian tour
that starts and ends at v and traverses each edge of the tree twice, once in each direction.
Since an edge of S represents a path of length 4 in H, at each vertex in the 4-tree the
Eulerian path can continue in as many as d 4 different ways, and therefore the number
of 4-trees of size s = r/d 3 in H is bounded by
3
m(d 4 )2s = md 8r/d .
The probability that the nodes of each such 4-tree survive in H is at most
((d + 1)2−k/2 )s = ((d + 1)2−k/2 )r/d .
3

Hence the probability that H has a connected component of size r is bounded by

md 8r/d ((d + 1)2−k/2 )r/d ≤ m2(rk/d
3 3 3
)(8α+2α−1/2)
= o(1)
for r ≥ c log2 m and for a suitably large constant c and a sufficiently small constant
α > 0.

Thus, we have the following theorem.

Theorem 6.16: Consider a k-SAT formula with m clauses, where k ≥ 12 is an even
constant and each variable appears in up to 2αk clauses for a sufficiently small constant
α > 0. Then there is an algorithm that finds a satisfying assignment for the formula in
expected time that is polynomial in m.
Proof: As we have described, if the first phase partitions the problem into subformu-
las involving only O(k log m) variables, then a solution can be found by solving each
subformula exhaustively in time that is polynomial in m. The probability of the first
phase partitioning the problem appropriately is 1 − o(1), so we need only run phase I
a constant number of times on average before obtaining a good partition. The theorem
follows.

6.9. Lovász Local Lemma: The General Case

For completeness we include the statement and proof of the general case of the Lovász
Local Lemma.
155
the probabilistic method

Theorem 6.17: Let E1 , . . . , En be a set of events in an arbitrary probability space,

and let G = (V, E ) be the dependency graph for these events. Assume there exist
x1 , . . . , xn ∈ [0, 1] such that, for all 1 ≤ i ≤ n,

Pr(Ei ) ≤ xi (1 − x j ).
(i, j)∈E

Then
n

n
Pr Ēi ≥ (1 − xi ).
i=1 i=1

Proof: Let S ⊆ {1, . . . , n}. We prove by induction on s = 0, . . . , n that, if |S| ≤ s, then

for all k we have
⎛ ⎞

Pr ⎝Ek Ē j ⎠ ≤ xk .
j∈S

As in the case of the symmetric version of the Local Lemma, we must be careful that
the conditional probability is well-defined. This follows using the same approach as in
the symmetric case, so we focus on the rest of the induction.
The base case s = 0 follows from the assumption that

Pr(Ek ) ≤ xk (1 − x j ) ≤ xk .
(k, j)∈E

For the inductive step, let S1 = { j ∈ S | (k, j) ∈ E} and S2 = S − S1 . If S2 = S then

Ek is mutually independent of the events Ēi , i ∈ S, and
⎛ ⎞

Pr ⎝Ek Ē j ⎠ = Pr(Ek ) ≤ xk .
j∈S

We continue with the case |S2 | < s. We again use the notation

FS = Ē j
j∈S

and define FS1 and FS2 similarly, so that FS = FS1 ∩ FS2 .

Applying the definition of conditional probability yields
Pr(Ek ∩ FS )
Pr(Ek | FS ) = . (6.6)
Pr(FS )
By once again applying the definition of conditional probability, the numerator of
Eqn. (6.6) can be written as
Pr(Ek ∩ FS ) = Pr(Ek ∩ FS1 | FS2 ) Pr(FS2 )
and the denominator as
Pr(FS ) = Pr(FS1 | FS2 ) Pr(FS2 ).
156
6.9 lovász local lemma: the general case

Canceling the common factor then yields

Pr(Ek ∩ FS1 | FS2 )
Pr(Ek | FS ) = . (6.7)
Pr(FS1 | FS2 )
Since the probability of an intersection of events is bounded by the probability of
each of the events and since Ek is independent of the events in S2 , we can bound the
numerator of Eqn. (6.7) by

Pr(Ek ∩ FS1 | FS2 ) ≤ Pr(Ek | FS2 ) = Pr(Ek ) ≤ xk (1 − x j ).
(k, j)∈E

To bound the denominator of Eqn. (6.7), let S1 = { j1 , . . . , jr }. Applying the induc-

tion hypothesis, we have
⎛ ⎞

Pr(FS1 | FS2 ) = Pr ⎝ Ē j Ē j ⎠
j∈S1 j∈S2
⎛ ⎛ i−1 ⎛ ⎞⎞⎞

r

= ⎝1 − Pr ⎝E ji Ē jt ∩ ⎝ Ē j ⎠⎠⎠
i=1 t=1 j∈S2

r
≥ (1 − x ji )
i=1

≥ (1 − x j ).
(k, j)∈E

Using the upper bound for the numerator and the lower bound for the denominator,
we can prove the induction hypothesis:
⎛ ⎞

Pr ⎝Ek Ē j ⎠ = Pr(Ek | FS )
j∈S

Pr(Ek ∩ FS1 | FS2 )

=
Pr(F | F )
S1 S2
xk (k, j)∈E (1 − x j )
≤
(k, j)∈E (1 − x j )
= xk .

The theorem now follows from:

n
Pr(Ē1 , . . . , Ēn ) = Pr(Ēi | Ē1 , . . . , Ēi−1 )
i=1
n
= (1 − Pr(Ei | Ē1 , . . . , Ēi−1 ))
i=1
n
≥ (1 − xi ) > 0.
i=1

157
the probabilistic method

6.10.∗ The Algorithmic Lovász Local Lemma

Recently, there have been several advances in extending the Lovász Local Lemma. We
briefly summarize the key points here, and start by looking again to the k-SAT problem
to provide an example of these ideas in action.
We have shown previously that if no variable in a k-SAT formula appears in more
than 2k /(4k) clauses, then the formula has a satisfying assignment, and we have shown
that if each variable appears in no more that 2αk clauses for some constant α a solution
can be found in expected polynomial time. Here we provide an improved result which
again provides a solution in expected polynomial time.

Theorem 6.18: Suppose that every clause in a k-SAT formula shares one or more
variables with at most 2k−3 − 1 other clauses. Then a solution for the formula exists
and can be found in expected time that is polynomial in the number of clauses m.

Before starting the proof, we informally describe our algorithm. As before, let
x1 , x2 , . . . , x be the variables and C1 , C2 , . . . , Cm be the m clauses in the formula.
We begin by choosing a random truth assignment (uniformly at random). We then look
for a clause Ci that is unsatisfied; if no such clause exists we are done. If such a clause
exists, we look specifically at the variables in the clause, and randomly choose a new
truth assignment for those variables. Doing so will hopefully “fix” the clause Ci so that
it is satisfied, but it may not; even worse, it may end up causing a clause C j that shares
a variable with Ci to become unsatisfied. We recursively fix these neighboring clauses,
so that when the recursion is finished, we have that Ci is satisfied and we have not
damaged any clause by making it become unsatisfied. We therefore have improved the
situation by satisfying at least one previously unsatisfied clause. We then continue to
the next unsatisfied clause; we have to do this at most m times.
The underlying question that we need to answer to show that this algorithm works
is how we know that the recursion we have described stops successfully. Perhaps it
simply goes on forever, or for an exponential amount of time. The proof we provide
shows that this cannot be the case through a new type of argument. Specifically, we
show that if such bad recursions occur with non-trivial probability, then one could
compress a random string of n independent, unbiased flips into much fewer than n bits.
That should seem impossible, and it is. While compression is a theme we cover in much
more detail in Chapter 10, we explain the compression result we need here in the proof
of the theorem. All we need is that a string of r random bits, where each bit is chosen
independently and uniformly at random, cannot be compressed so that the average
length of the representation over all choices of the r random bits is less than r − 2.
To see that this must be true, assume the best possible setting for us, where we don’t
have to worry about the “end” of our compressed sequence, but can use each string of
bits of length less than r to represent one of the 2r possible strings we aim to compress.
That is, we won’t worry that one compressed string might be “0” and another one
might be “00”, in which case it might be hard to distinguish whether “00” was meant
to represent a single compressed string, or two copies of the string represented by “0”.
(Essentially, a compressed string can be terminated for free; this allowance can only
158
∗
6.10 the algorithmic lovász local lemma

hurt us in our argument.) Still, each string of s < r bits can only represent a single
possible string of length r. Hence we have available one string of length 0 (the empty
string), two strings of length 1, and so on. There are only 2r − 1 strings of length less
than r; even if we count only those in computing the average length of the compressed
string, which again can only hurt us, the average length would be at least

r−1
1
r−i
· i ≥ r − 2.
i=1
2
The same compression fact naturally holds true for any collection of 2r equally likely
strings; they do not have to be limited to strings of r random bits.
Given this fact, our proof proceeds as follows.

Proof of Theorem 6.18: The algorithm we use is explicitly provided as the k-

Satisfiability Algorithm below.

k-Satisfiability Algorithm (Using Algorithmic LLL):

Input: A collection C1 , C2 , . . . , Cm of clauses for a k-SAT formula over n variables.

Output: A truth assignment for these variables.
Main routine:
1. Start with a random assignment.
2. While some Ci is not satisfied:
(a) choose the unsatisfied Ci with the smallest value of i;
(b) enter i in binary using log2 m bits in the history;
(c) use localcorrect on clause Ci .
localcorrect(C):
1. Resample new values for every variable in clause C.
2. While some C j that shares a variable with C (including possibly C itself) is not
satisfied
(a) choose the unsatisfied C j sharing a variable with C with the smallest value
of j;
(b) enter “0” followed by j in binary using k − 3 bits in the history;
(c) use localcorrect on clause C j .
3. Enter “1” in the history.

We note the algorithm produces a history, which we use in the analysis of the algo-
rithm.
It is important to realize that while a clause can become satisfied and unsatisfied
again multiple times through the recursive process, when we return to the main routine
and complete the call to localcorrect, we have satisfied the clause Ci that localcorrect
was called on from the main routine, and further any clause that was previously satisfied
has stayed satisfied because of the recursion. What we wish to show is that the recursive
process has to stop.
159
the probabilistic method

Our analysis makes use of the fact that our algorithm makes use of a random string.
We provide two different ways to describe how our algorithm runs.
We can think of our algorithm as being described by the random string of bits it uses.
It takes n bits to initially assign random truth values to each of the variables. After that,
it takes k bits to resample values for a clause each time localcorrect is called. Let us refer
to each time localcorrect is called as a round. Then one way to describe our algorithm’s
actions for j rounds is with the random string of n + jk bits used by the algorithm.
But here is another way of describing how our algorithm works. We keep track of the
“history” of the algorithm as shown in the algorithm. The history includes a list of the
clauses that localcorrect is called on by the main routine. The history also includes a list
of the recursive calls to localcorrect, in a slightly non-obvious way. First, we note that
the algorithm uses a flag bit 0 and a flag bit 1 to mark the start and end of recursive calls,
so the algorithm tracks the stack of recursive calls in a natural way. Second, instead of
the natural approach of using log2 m bits to represent the index of the clause in our
recursive calls, the algorithm uses only k − 3 bits. We now explain why only k − 3 bits
are needed. Since there are at most 2k−3 possible clauses that share a variable with the
current clause (including the current clause itself) that could be the next one called, the
clause can be represented by an index of k − 3 bits. (Imagine having an ordered list of
the up to 2k−3 clauses that share a variable with each clause; we just need the index into
that list.) Finally, our history will also include the current truth assignment of n bits.
Note that the current truth assignment can be thought of as in an separate updatable
storage area for the history; every time the truth assignment is updated, so is this part
of the history.
We now show that when the algorithm has run j rounds, we can recover the random
string of n + jk bits that the algorithm has used from the history we have described.
Start with the current truth assignment, and break the history up, using the flags that
mark invocations of localcorrect. We can use the history to determine the sequence of
recursive calls, and what clauses localcorrect was called on. Then, going backwards
through the history, we know at each step which clause was being resampled. For that
clause to have to be resampled, it must have been unsatisfied previously. But there is
only one setting of the variables that makes a clause unsatisfied, and hence we know
what the truth values for those variables were before the clause was resampled. We
can therefore update the current truth assignment so that it represents the truth assign-
ment before the resampling, and continue backwards through the process. Repeating
this action, we can determine the original truth assignment, and since at each step we
can determine what variable values were changed and what their values were on each
resampling, we recover the whole string of n + jk random bits.
Our history takes at most n + mlog2 m + j(k − 1) bits; here we use the fact that
each resampling uses at most k − 1 bits, including the two bits that may be necessary as
flags for the start and end of the recursion given by that resampling. For large enough j,
our history yields a compressed form of the the random string used to run the algorithm,
since only k − 1 bits are used to represent each resampling in the history instead of the
k bits used by the algorithm.
Now suppose there were no truth assignment, in which case the algorithm would
run forever. Then after a large enough number of rounds J, the history will be at most
160
∗
6.10 the algorithmic lovász local lemma

n + mlog2 m + J(k − 1) bits, while the random string running the algorithm would
be n + Jk bits. By our result on compressing random strings, we must have
n + mlog2 m + J(k − 1) ≥ n + Jk − 2.
Hence
J ≤ mlog2 m + 2.
This contradicts that the algorithm can run forever, so there must be a truth assignment.
Similarly, the number of rounds J is more than mlog2 m + 2 + i with probability
at most 2−i . To see this, suppose the probability of lasting to this round is greater than
2−i . Again consider the algorithm after J = mlog2 m + 2 + i rounds, so the history
will be at most n + mlog2 m + J(k − 1) bits. The algorithm can also be described
by the n + Jk random bits that led to the current state. As there are at least 2n+Jk ran-
dom bit strings of this length, and the probability of lasting at least this many rounds
is greater than 2−i by assumption, there are at least 2n+Jk−i random bit strings associ-
ated with reaching this round. By our result on compressing random strings, it requires
more than n + Jk − i − 2 bits on average to represent the at least 2n+Jk−i random bit
strings associated with reaching this round. But the history, as we have already argued,
provides a representation of these random bit strings, in that we can reconstruct the
algorithm’s random bit string from the history. The number of bits the history uses is
only
n + mlog2 m + J(k − 1) = n + Jk − i − 2,
a contradiction.
Since the probability of lasting more than mlog2 m + 2 + i is at most 2−i , we can
bound the expected number of rounds by
∞

mlog2 m + 2 + i2−i .
i=1

The expected number rounds used by the algorithm is thus at most mlog2 m + 4.
The work done in each resampling round can easily be made to be polynomial in
m, so the total expected time to find an assigment can be made polynomial in m as
well.
While already surprising, the proof above can be improved slightly. A more careful
encoding shows that the expected number of rounds required can be reduced to O(m)
instead of O(m log m). This is covered in the Exercise 6.21.
The algorithmic approach we have used for the satisfiability problem in the proof of
Theorem 6.18 can be extended further to obtain an algorithmic version of the Lovász
local lemma, which we now describe. Let us suppose that we have a collection of n
events E1 , E2 , . . . , En that depend on a collection of mutually independent variables
y1 , y2 , . . . , y . The dependency graph on events has an edge between two events if they
both depend on at least one shared variable yi . The idea is that at each step if there
is an event that is unsatisfied, we resample only the random variables on which that
event depends. As with the k-Satisfiability Algorithm using the algorithmic Lovász
161
the probabilistic method

Local Lemma, this resampling process has to be ordered carefully to ensure progress.
If the dependencies are not too great, then the right resampling algorithm terminates
with a solution.
The symmetric version is easier to state.
Theorem 6.19: Let E1 , E2 , . . . , En be a set of events in an arbitrary probability space
that are determined by mutually independent random variables y1 , y2 , . . . , y , and let
G = (V, E ) be the dependency graph for these events. Suppose the following conditions
hold for values d and p:
1. each event Ei is adjacent to at most d other events in the dependency graph, or
equivalently, there are only d other events that also depend on one or more of the
y j that Ei depends on;
2. Pr(Ei ) ≤ p;
3. ep(d + 1) ≤ 1.
Then there exists an assignment of the yi so that the event ∩ni=1 Ēi holds, and a resam-
pling algorithm with the property that the expected number of times the algorithm
resamples the event Ei in finding such an assignment is at most 1/d. Hence the expected
total number of resampling steps taken by the algorithm is at most n/d.
However, we also have a corresponding theorem for the asymmetric version.
Theorem 6.20: Let E1 , E2 , . . . , En be a set of events in an arbitrary probability
space that are determined by mutually independent random variables y1 , y2 , . . . , y ,
and let G = (V, E ) be the dependency graph for these events. Assume there exist
x1 , x2 , . . . , xn ∈ [0, 1] such that, for all 1 ≤ i ≤ n

Pr(Ei ) ≤ xi (1 − x j ).
(i, j)∈E

Then there exists an assignment of the yi so that the event ∩ni=1 Ēi holds, and a resam-
pling algorithm with the property that the expected number of times the algorithm
resamples the event Ei in finding such an assignment is at most xi /(1 − xi ). Hence
the
n expected total number of resampling steps taken by the algorithm is at most
i=1 xi /(1 − xi ).

The proofs of these theorems are beyond the scope of the book. Similar to the algo-
rithm for satisfiability based on resampling given above, the proof relies on bounding
the expected number of resamplings that occur over the course of the algorithm.

6.11. Exercises

Exercise 6.1: Consider an instance of SAT with m clauses, where every clause has
exactly k literals.
(a) Give a Las Vegas algorithm that finds an assignment satisfying at least m(1 − 2−k )
clauses, and analyze its expected running time.
162
6.11 exercises

(b) Give a derandomization of the randomized algorithm using the method of condi-
tional expectations.

Exercise 6.2:
(a) Prove that, for every integer n, there exists a coloring of the edges of the complete
graph Kn by
two colors so that the total number of monochromatic copies of K4 is
at most n4 2−5 .
(b) Give a randomized algorithm for finding a coloring with at most n4 2−5 mono-
chromatic copies of K4 that runs in expected time polynomial in n.
(c) Show how to construct such a coloring deterministically in polynomial time using
the method of conditional expectations.

Exercise 6.3: Given an n-vertex undirected graph G = (V, E ), consider the following
method of generating an independent set. Given a permutation σ of the vertices, define
a subset S(σ ) of the vertices as follows: for each vertex i, i ∈ S(σ ) if and only if no
neighbor j of i precedes i in the permutation σ .
(a) Show that each S(σ ) is an independent set in G.
(b) Suggest a natural randomized algorithm to produce σ for which you can show that
the expected cardinality of S(σ ) is

n
1
,
i=1
di + 1
where di denotes the degree of vertex i.
(c) Prove that G has an independent set of size at least ni=1 1/(di + 1).

Exercise 6.4: Consider the following two-player game. The game begins with k tokens
placed at the number 0 on the integer number line spanning [0, n]. Each round, one
player, called the chooser, selects two disjoint and nonempty sets of tokens A and B.
(The sets A and B need not cover all the remaining tokens; they only need to be disjoint.)
The second player, called the remover, takes all the tokens from one of the sets off the
board. The tokens from the other set all move up one space on the number line from
their current position. The chooser wins if any token ever reaches n. The remover wins
if the chooser finishes with one token that has not reached n.
(a) Give a winning strategy for the chooser when k ≥ 2n .
(b) Use the probabilistic method to show that there must exist a winning strategy for
the remover when k < 2n .
(c) Explain how to use the method of conditional expectations to derandomize the
winning strategy for the remover when k < 2n .

Exercise 6.5: We have shown using the probabilistic method that, if a graph G has n
nodes and m edges, then there exists a partition of the n nodes into sets A and B such
that at least m/2 edges cross the partition. Improve this result slightly: show that there
exists a partition such that at least mn/(2n − 1) edges cross the partition.
163
the probabilistic method

Exercise 6.6: We can generalize the problem of finding a large cut to finding a large
k-cut. A k-cut is a partition of the vertices into k disjoint sets, and the value of a cut is
the weight of all edges crossing from one of the k sets to another. In Section 6.2.1 we
considered 2-cuts when all edges had the same weight 1, showing via the probabilistic
method that any graph G with m edges has a cut with value at least m/2. Generalize
this argument to show that any graph G with m edges has a k-cut with value at least
(k − 1)m/k. Show how to use derandomization (following the argument of Section 6.3)
to give a deterministic algorithm for finding such a cut.

Exercise 6.7: A hypergraph H is a pair of sets (V, E ), where V is the set of vertices
and E is the set of hyperedges. Every hyperedge in E is a subset of V. In particular, an
r-uniform hypergraph is one where the size of each edge is r. For example, a 2-uniform
hypergraph is just a standard graph. A dominating set in a hypergraph H is a set of
vertices S ⊂ V such that e ∩ S = ∅ for every edge e ∈ E. That is, S hits every edge of
the hypergraph.
Let H = (V, E ) be an r-uniform hypergraph with n vertices and m edges. Show
that there is a dominating set of size at most np + (1 − p)r m for every real number
0 ≤ p ≤ 1. Also, show that there is a dominating set of size at most (m + n ln r)/r.

Exercise 6.8: Prove that, for every integer n, there exists a way to 2-color the edges
of Kx so that there is no monochromatic clique of size k when

n 1− 2k
x=n− 2 .
k
(Hint: Start by 2-coloring the edges of Kn , then fix things up.)

Exercise 6.9: A tournament is a graph on n vertices with exactly one directed edge
between each pair of vertices. If vertices represent players, then each edge can be
thought of as the result of a match between the two players: the edge points to the win-
ner. A ranking is an ordering of the n players from best to worst (ties are not allowed).
Given the outcome of a tournament, one might wish to determine a ranking of the play-
ers. A ranking is said to disagree with a directed edge from y to x if y is ahead of x in
the ranking (since x beat y in the tournament).
(a) Prove that, for every tournament, there exists a ranking that disagrees with at most
50% of the edges.
(b) Prove that, for sufficiently large n, there exists a tournament such that every ranking
disagrees with at least 49% of the edges in the tournament.

Exercise 6.10: A family of subsets F of {1, 2, . . . , n} is called an antichain if there is

no pair of sets A and B in F satisfying A ⊂ B.
n
(a) Give an example of F where |F| = n/2 .
(b) Let fk be the number of sets in F with size k. Show that

n
fk
n ≤ 1.
k=0 k

164
6.11 exercises

(Hint: Choose a random permutation of the numbers from 1 to n, and let Xk = 1 if

the first k numbers in your permutation yield a set in F. If X = nk=0 Xk , what can
you say about X?)
(c) Argue that |F| ≤ n/2n
for any antichain F.

Exercise 6.11: Consider a graph in Gn,p with n vertices and each pair of vertices
independently connected by an edge with probability p. We prove a threshold for the
existence of triangles in the graph.
Let t1 , . . . , t(n ) be an enumeration of all triplets of three vertices in the graph. Let
3
Xi = 1 if the the three edges of the triplet ti appear in the graph, so that ti forms a triangle
(n3 )
in the graph. Otherwise Xi = 0. Let X = i=1 Xi .
(a) Compute E[X].
(b) Use (a) to show that if pn → 0 then Pr(X > 0) → 0.
(c) Show that Var[Xi ] ≤ p3 .
(d) Show that Cov(Xi , X j ) = p5 − p6 for O(n4 ) pairs i = j, otherwise Cov(Xi , X j ) =
0.
(e) Show that Var[X] = O(n3 p3 + n4 (p5 − p6 )).
(f) Conclude that if p is such that pn → ∞ then Pr(X = 0) → 0.

Exercise 6.12: In Section 6.5.1, we bounded the variance of the number of 4-cliques
in a random graph in order to demonstrate the second moment method. Showhow to
calculate the variance directly by using the equality from Exercise 3.9: for X = n1=1 Xi
the sum of Bernoulli random variables,
n
E[X 2 ] = Pr(Xi = 1)E[X | Xi = 1].
i=1

Exercise 6.13: Consider the problem of whether graphs in Gn,p have cliques of con-
stant size k. Suggest an appropriate threshold function for this property. Generalize the
argument used for cliques of size 4, using either the second moment method or the
conditional expectation i

Probability and Computing 2nd Edition

Uploaded by

Probability and Computing 2nd Edition

Uploaded by

Probability and Computing

Randomization and probabilistic techniques play an important role in modern computer

Michael Mitzenmacher is a Professor of Computer Science in the School of Engineering

Michael Mitzenmacher Eli Upfal

Cambridge University Press is part of the University of Cambridge.

Stephanie, Michaela, Jacqueline, and Chloe

Liane, Tamara, and Ilan

Preface to the Second Edition page xv

1 Events and Probability 1

2 Discrete Random Variables and Expectation 23

3 Moments and Deviations 47

3.3 Chebyshev’s Inequality 51

4 Chernoff and Hoeffding Bounds 66

5 Balls, Bins, and Random Graphs 97

6 The Probabilistic Method 135

6.2 The Expectation Argument 137

7 Markov Chains and Random Walks 168

8 Continuous Distributions and the Poisson Process 205

8.4.2 Combining and Splitting Poisson Processes 222

9 The Normal Distribution 242

10 Entropy, Randomness, and Information 269

11 The Monte Carlo Method 297

12 Coupling of Markov Chains 317

14 Sample Complexity, VC Dimension, and Rademacher

14.6.2 Estimating the Rademacher Complexity 387

15 Pairwise Independence and Universal Hash Functions 392

16 Power Laws and Related Distributions 415

17 Balanced Allocations and Cuckoo Hashing 433

17.5.2 Handling Failures 453

Further Reading 463

Note: Asterisks indicate advanced material for this chapter.

This textbook is designed to accompany one- or two-semester courses for advanced

technique. Techniques are clarified though examples based on analyzing randomized

1.1. Application: Verifying Polynomial Identities

the first i − 1 monomials requires (d 2 ) multiplications of coefficients. We assume in

1.2. Axioms of Probability

Lemma 1.1: For any two events E1 and E2 ,

Pr(E1 ∪ E2 ) = Pr(E1 ) + Pr(E2 ) − Pr(E1 ∩ E2 ).

Proof: From the definition,

Pr(E1 ) = Pr(E1 − (E1 ∩ E2 )) + Pr(E1 ∩ E2 ),

The lemma easily follows.

A consequence of Definition 1.2 is known as the union bound. Although it is very

Lemma 1.2: For any finite or countably infinite sequence of events E1 , E2 , . . . ,

Lemma 1.3: Let E1 , . . . , En be any n events. Then

The proof of the inclusion–exclusion principle is left as Exercise 1.7.

d simple events, and therefore

Notice that, when E and F are independent and Pr(F ) = 0, we have

considering for theoretical reasons. In practice, sampling with replacement is often

1.3. Application: Verifying Matrix Multiplication

Theorem 1.6 [Law of Total Probability]:

by the definition of conditional probability.

Pr(B | E ) Pr(E ) 1/2 2

1.4. Application: Naïve Bayesian Classifier

A naïve Bayesian classifier is a supervised learning algorithm that classifies objects by

{(D1 , c(D1 )), (D2 , c(D2 )), . . . , (Dn , c(Dn ))},

where each Di is represented as a features vector xi = (x1i , . . . , xm i

ˆ k∗ = xk | c(D∗ ) = c j ) = |{i : xk = xk , c(Di ) = c j }| .

Naïve Bayes Classifier Algorithm

Input: Set of possible classifications C, set of features and feature values

ˆ k∗ = xk | c(D∗ ) = c) = |{i : xk = xk , c(Di ) = c}| .

2. To compute a classification distribution:

Algorithm 1.1: Naïve Bayes Classifier.

a collection of items characterized by two Boolean features X and Y. If X = Y the item

1.5. Application: A Randomized Min-Cut Algorithm

(c) The sum of the dice is even.

Exercise 1.3: We shuffle a standard deck of cards, obtaining a permutation that

Exercise 1.7: (a) Prove Lemma 3, the inclusion–exclusion principle.

(c) Prove that, when  is even,

Exercise 1.14: I am playing in a racquetball tournament, and I am up against a player

Exercise 1.18: We have a function F : {0, . . . , n − 1} → {0, . . . , m − 1}. We

Exercise 1.20: Show that, if E1 , E2 , . . . , En are mutually independent, then so are

Exercise 1.24: Generalizing on the notion of a cut-set, we define an r-way cut-set in a

2.1. Random Variables and Expectation

2.1.1. Linearity of Expectations

Applying the linearity of expectations, we have

E[X] = E[X1 ] + E[X2 ] = 7.

Notice that, when E and F are independent and Pr(F ) = 0, we have

(c) Prove that, when is even,

E[ f (X )] ≥ E[ f (μ) + f (μ)(X − μ)]