Notes On Randomized Algorithms
Notes On Randomized Algorithms
James Aspnes
2024-07-27 23:04
i
Table of contents ii
List of figures xv
Preface xviii
1 Randomized algorithms 1
1.1 Searching an array . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Verifying polynomial identities . . . . . . . . . . . . . . . . . 4
1.3 Randomized QuickSort . . . . . . . . . . . . . . . . . . . . . . 6
1.3.1 Brute force method: solve the recurrence . . . . . . . 6
1.3.2 Clever method: use linearity of expectation . . . . . . 7
1.4 Where does the randomness come from? . . . . . . . . . . . . 9
1.5 Classifying randomized algorithms . . . . . . . . . . . . . . . 10
1.5.1 Las Vegas vs Monte Carlo . . . . . . . . . . . . . . . . 10
1.5.2 Randomized complexity classes . . . . . . . . . . . . . 11
1.6 Classifying randomized algorithms by their methods . . . . . 13
2 Probability theory 15
2.1 Probability spaces and events . . . . . . . . . . . . . . . . . . 16
2.1.1 General probability spaces . . . . . . . . . . . . . . . . 16
2.2 Boolean combinations of events . . . . . . . . . . . . . . . . . 18
2.3 Conditional probability . . . . . . . . . . . . . . . . . . . . . 21
2.3.1 Conditional probability and independence . . . . . . . 21
2.3.2 Conditional probability and the law of total probability 21
2.3.3 Examples . . . . . . . . . . . . . . . . . . . . . . . . . 23
ii
CONTENTS iii
3 Random variables 28
3.1 Operations on random variables . . . . . . . . . . . . . . . . . 29
3.2 Random variables and events . . . . . . . . . . . . . . . . . . 29
3.3 Measurability . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.4 Expectation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.4.1 Linearity of expectation . . . . . . . . . . . . . . . . . 32
3.4.1.1 Linearity of expectation for infinite sequences 33
3.4.2 Expectation and inequalities . . . . . . . . . . . . . . 34
3.4.3 Expectation of a product . . . . . . . . . . . . . . . . 35
3.4.3.1 Wald’s equation (simple version) . . . . . . . 35
3.5 Conditional expectation . . . . . . . . . . . . . . . . . . . . . 36
3.5.1 Expectation conditioned on an event . . . . . . . . . . 37
3.5.2 Expectation conditioned on a random variable . . . . 38
3.5.2.1 Calculating conditional expectations . . . . . 39
3.5.2.2 The law of iterated expectation . . . . . . . . 41
3.5.2.3 Conditional expectation as orthogonal pro-
jection . . . . . . . . . . . . . . . . . . . . . 41
3.5.3 Expectation conditioned on a σ-algebra . . . . . . . . 43
3.5.4 Examples . . . . . . . . . . . . . . . . . . . . . . . . . 44
3.6 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
3.6.1 Yao’s lemma . . . . . . . . . . . . . . . . . . . . . . . 45
3.6.2 Geometric random variables . . . . . . . . . . . . . . . 47
3.6.3 Coupon collector . . . . . . . . . . . . . . . . . . . . . 48
3.6.4 Hoare’s FIND . . . . . . . . . . . . . . . . . . . . . . . 49
5 Concentration bounds 59
5.1 Chebyshev’s inequality . . . . . . . . . . . . . . . . . . . . . . 60
5.1.1 Computing variance . . . . . . . . . . . . . . . . . . . 60
5.1.1.1 Alternative formula . . . . . . . . . . . . . . 60
5.1.1.2 Variance of a Bernoulli random variable . . . 61
5.1.1.3 Variance of a sum . . . . . . . . . . . . . . . 61
5.1.1.4 Variance of a geometric random variable . . 63
5.1.2 More examples . . . . . . . . . . . . . . . . . . . . . . 66
5.1.2.1 Flipping coins . . . . . . . . . . . . . . . . . 66
5.1.2.2 Balls in bins . . . . . . . . . . . . . . . . . . 66
5.1.2.3 Lazy select . . . . . . . . . . . . . . . . . . . 67
5.2 Chernoff bounds . . . . . . . . . . . . . . . . . . . . . . . . . 68
5.2.1 The classic Chernoff bound . . . . . . . . . . . . . . . 69
5.2.2 Easier variants . . . . . . . . . . . . . . . . . . . . . . 71
5.2.3 Lower bound version . . . . . . . . . . . . . . . . . . . 72
5.2.4 Two-sided version . . . . . . . . . . . . . . . . . . . . 72
5.2.5 What if we only have a bound on E [S]? . . . . . . . . 73
5.2.6 Almost-independent variables . . . . . . . . . . . . . . 74
5.2.7 Other tail bounds for the binomial distribution . . . . 75
5.2.8 Applications . . . . . . . . . . . . . . . . . . . . . . . 75
5.2.8.1 Flipping coins . . . . . . . . . . . . . . . . . 75
5.2.8.2 Balls in bins again . . . . . . . . . . . . . . . 76
5.2.8.3 Flipping coins, central behavior . . . . . . . 76
5.2.8.4 Permutation routing on a hypercube . . . . . 77
5.3 The Azuma-Hoeffding inequality . . . . . . . . . . . . . . . . 80
5.3.1 Hoeffding’s inequality . . . . . . . . . . . . . . . . . . 80
5.3.1.1 Hoeffding vs Chernoff . . . . . . . . . . . . . 82
5.3.1.2 Asymmetric version . . . . . . . . . . . . . . 83
5.3.2 Azuma’s inequality . . . . . . . . . . . . . . . . . . . . 83
5.3.3 The method of bounded differences . . . . . . . . . . . 88
5.3.4 Applications . . . . . . . . . . . . . . . . . . . . . . . 90
5.3.4.1 Sprinkling points on a hypercube . . . . . . . 90
5.3.4.2 Chromatic number of a random graph . . . . 91
5.3.4.3 Balls in bins . . . . . . . . . . . . . . . . . . 92
5.3.4.4 Probabilistic recurrence relations . . . . . . . 92
5.3.4.5 Multi-armed bandits . . . . . . . . . . . . . . 94
The UCB1 algorithm . . . . . . . . . . . . . . . 95
CONTENTS v
Analysis of UCB1 . . . . . . . . . . . . . . . . . 96
5.4 Relation to limit theorems . . . . . . . . . . . . . . . . . . . . 99
5.5 Anti-concentration bounds . . . . . . . . . . . . . . . . . . . . 100
5.5.1 The Berry-Esseen theorem . . . . . . . . . . . . . . . . 100
5.5.2 The Littlewood-Offord problem . . . . . . . . . . . . . 101
7 Hashing 114
7.1 Hash tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
7.2 Universal hash families . . . . . . . . . . . . . . . . . . . . . . 115
7.2.1 Linear congruential hashing . . . . . . . . . . . . . . . 118
7.2.2 Tabulation hashing . . . . . . . . . . . . . . . . . . . . 119
7.3 FKS hashing . . . . . . . . . . . . . . . . . . . . . . . . . . . 120
7.4 Cuckoo hashing . . . . . . . . . . . . . . . . . . . . . . . . . . 122
7.4.1 Structure . . . . . . . . . . . . . . . . . . . . . . . . . 123
7.4.2 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . 124
7.5 Practical issues . . . . . . . . . . . . . . . . . . . . . . . . . . 127
7.6 Bloom filters . . . . . . . . . . . . . . . . . . . . . . . . . . . 128
7.6.1 Construction . . . . . . . . . . . . . . . . . . . . . . . 128
7.6.2 False positives . . . . . . . . . . . . . . . . . . . . . . 128
7.6.3 Comparison to optimal space . . . . . . . . . . . . . . 130
7.6.4 Applications . . . . . . . . . . . . . . . . . . . . . . . 132
7.6.5 Counting Bloom filters . . . . . . . . . . . . . . . . . . 132
7.7 Data stream computation . . . . . . . . . . . . . . . . . . . . 133
7.7.1 Cardinality estimation . . . . . . . . . . . . . . . . . . 134
7.7.2 Count-min sketches . . . . . . . . . . . . . . . . . . . 136
7.7.2.1 Initialization and updates . . . . . . . . . . . 137
7.7.2.2 Queries . . . . . . . . . . . . . . . . . . . . . 137
7.7.2.3 Finding heavy hitters . . . . . . . . . . . . . 140
CONTENTS vi
14 Derandomization 253
14.1 Deterministic vs. randomized algorithms . . . . . . . . . . . . 254
14.2 Adleman’s theorem . . . . . . . . . . . . . . . . . . . . . . . . 255
14.3 Limited independence . . . . . . . . . . . . . . . . . . . . . . 256
14.3.1 MAX CUT . . . . . . . . . . . . . . . . . . . . . . . . 256
14.4 The method of conditional probabilities . . . . . . . . . . . . 257
14.4.1 MAX CUT using conditional probabilities . . . . . . . 258
14.4.2 Deterministic construction of Ramsey graphs . . . . . 259
14.4.3 Derandomized set balancing . . . . . . . . . . . . . . . 259
Bibliography 491
Index 510
List of Figures
xv
List of Tables
xvi
List of Algorithms
xvii
Preface
These are notes for the Yale course CPSC 469/569 Randomized Algorithms.
This document also incorporates the lecture schedule and assignments, as
well as some sample assignments from previous semesters. Because this
is a work in progress, it will be updated frequently over the course of the
semester.
Much of the structure of the course follows Mitzenmacher and Upfals’s
Probability and Computing [MU17], with some material from Motwani and
Raghavan’s Randomized Algorithms [MR95]. In most cases you’ll find these
textbooks contain much more detail than what is presented here, so it is
probably better to consider this document a supplement to them than to
treat it as your primary source of information.
The most recent version of these notes will be available at https:
//www.cs.yale.edu/homes/aspnes/classes/469/notes.pdf. More stable
archival versions may be found at https://arxiv.org/abs/2003.01902.
I would like to thank my many students and teaching fellows over the
years for their help in pointing out errors and omissions in earlier drafts of
these notes.
xviii
Chapter 1
Randomized algorithms
1
CHAPTER 1. RANDOMIZED ALGORITHMS 2
In this case, the bad input distribution is simple: put the 1 in each
position A[i] with equal probability 1/n. For a deterministic algorithm, there
will be some fixed sequence of positions i1 , i2 , . . . that it examines as long as
it only sees zeros. A smart deterministic algorithm will not examine the same
position twice, so the 1 is equally likely to be found 1, 2, 3, . . . , n probes. This
gives the same expected n+1 2 probes as for the simple randomized algorithm,
which shows that that algorithm is optimal.
We’ve been talking about searching an array, because that fits best in
our model of an input supplied to the algorithm, but essentially the same
analysis applies to brute-force inverting a black-box function. Here we have
a function f and target output y, and we want to find an input x such that
f (x) = y. The same analysis as for the array case shows that this takes n+1 2
expected evaluations of f assuming that exactly one x works and we can’t
do anything clever.
√
Curiously, in this case it may be possible to improve this bound to O( n)
evaluations if somehow we get our hands on a working quantum computer.
We’ll come back to this when we discuss quantum computing in general and
Grover’s algorithm in particular in Chapter 16.
get a root. Indeed, evaluating p(11) = 112320 and q(11) = 120306 quickly
shows that p and q are not in fact the same.
This is an example of a Monte Carlo algorithm, which is an algorithm
that runs in a fixed amount of time but only gives the right answer some of
the time. (In this case, with probability 1 − d/r, where r is the size of the
range of random integers we choose x from.) Monte Carlo algorithms have
the unnerving property of not indicating when their results are incorrect,
but we can make the probability of error as small as we like by running
the algorithm repeatedly. For this particular algorithm, the probability of
error after k trials is only (d/r)k , which means that for fixed d/r we need
O(log(1/)) iterations to get the error bound down to any given . If we are
really paranoid, we could get the error down to 0 by testing d + 1 distinct
values, but now the cost is as high as multiplying out p again.
The error for this algorithm is one-sided: if we find a witness to the fact
that p 6= q, we are done, but if we don’t, then all we know is that we haven’t
found a witness yet. We also have the property that if we check enough
possible witnesses, we are guaranteed to find one.
A similar property holds in the classic Miller-Rabin primality test,
a randomized algorithm for determining whether a large integer is prime
or not.4 The original version, due to Gary Miller [Mil76] showed that, as
in polynomial identity testing, it might be sufficient to pick a particular
set of deterministic candidate witnesses. Unfortunately, this result depends
on the truth of the extended Riemann hypothesis, a notoriously difficult
open problem in number theory. Michael Rabin [Rab80] demonstrated that
choosing random witnesses was enough, if we were willing to accept a small
probability of incorrectly identifying a composite number as prime.
For many years it was open whether it was possible to test primality
deterministically in polynomial time without unproven number-theoretic
assumptions, and the randomized Miller-Rabin algorithm was one of the
most widely-used randomized algorithms for which no good deterministic
alternative was known. Eventually, Agrawal et al. [AKS04] demonstrated
how to test primality deterministically using a different technique, although
the cost of their algorithm is high enough that Miller-Rabin is still used in
practice.
4
We will not describe this algorithm here.
CHAPTER 1. RANDOMIZED ALGORITHMS 6
1 n−1
X
T (n) = (n − 1) + (T (k) + T (n − 1 − k)) . (1.3.1)
n k=0
1 n−1
X
T (n) = (n − 1) + (T (k) + T (n − 1 − k))
n k=0
2 n−1
X
= (n − 1) + T (k)
n k=0
2 n−1
X
= (n − 1) + T (k)
n k=1
2 n−1
X
≤ (n − 1) + ak log k
n k=1
Z n
2
≤ (n − 1) + ak log k
n k=1
!
2a n2 log n n2 1
= (n − 1) + − +
n 2 4 4
an a
= (n − 1) + an log n − + .
2 2n
If we squint carefully at this recurrence for a while we notice that setting
a = 2 makes this less than or equal to an log n, since the remaining terms
become (n − 1) − n + 1/n = 1/n − 1, which is negative for n ≥ 1. We can
thus confidently conclude that T (n) ≤ 2n log n (for n ≥ 1).
recurrence in §1.3.1. Given that we now know the exact answer, we could in
principle go back and use it to solve the recurrence exactly.5
Which way is better? Solving the recurrence requires less probabilistic
handwaving (a more polite term might be “insight”) but more grinding
out inequalities, which is a pretty common trade-off. Since I am personally
not very clever I would try the brute-force approach first. But it’s worth
knowing about better methods so you can try them in other situations.
produce the wrong answer. These are examples of two classes of randomized
algorithms, which were originally named by László Babai [Bab79]:7
• A Las Vegas algorithm fails with some probability, but we can tell
when it fails. In particular, we can run it again until it succeeds, which
means that we can eventually succeed with probability 1 (but with a
potentially unbounded running time). Alternatively, we can think of a
Las Vegas algorithm as an algorithm that runs for an unpredictable
amount of time but always succeeds (we can convert such an algorithm
back into one that runs in bounded time by declaring that it fails if it
runs too long—a condition we can detect). QuickSort is an example of
a Las Vegas algorithm.
• A Monte Carlo algorithm fails with some probability, but we can’t
tell when it fails. If the algorithm produces a yes/no answer and the
failure probability is significantly less than 1/2, we can reduce the
probability of failure by running it many times and taking a majority of
the answers. The polynomial equality-testing algorithm in an example
of a Monte Carlo algorithm.
The heuristic for remembering which class is which is that the names
were chosen to appeal to English speakers: in Las Vegas, the dealer can tell
you whether you’ve won or lost, but in Monte Carlo, le croupier ne parle que
Français, so you have no idea what he’s saying.
Generally, we prefer Las Vegas algorithms, because we like knowing
when we have succeeded. But sometimes we have to settle for Monte Carlo
algorithms, which can still be useful if we can get the probability of failure
small enough. For example, any time we try to estimate an average by
sampling (say, inputs to a function we are trying to integrate or political
views of voters we are trying to win over) we are running a Monte Carlo
algorithm: there is always some possibility that our sample is badly non-
representative, but we can’t tell if we got a bad sample unless we already
know the answer we are looking for.
these algorithms, we never get a bogus “yes” answer but may get a bogus
“no” answer (or vice versa). This gives us several complexity classes that act
like randomized versions of NP, co-NP, etc.:
• The class co-R consists of all languages L for which a poly-time Turing
machine M exists such that if x 6∈ L, then Pr [M (x, r) = 1] ≥ 1/2 and
if x ∈ L, then Pr [M (x, r) = 1] = 0. This is the randomized analog of
co-NP.
Probability theory
15
CHAPTER 2. PROBABILITY THEORY 16
2. Pr [Ω] = 1.
CHAPTER 2. PROBABILITY THEORY 17
It’s not hard to see that the discrete probability spaces defined in the
preceding section satisfy these axioms.
General probability spaces arise in randomized algorithms when we have
an algorithm that might consume an unbounded number of random bits.
The problem now is that an outcome consists of countable sequence of bits,
and there are uncountably many such outcomes. The solution is to consider
as measurable events only those sets with the property that membership
in them can be determined after a finite amount of time. Formally, the
probability space Ω is the set {0, 1}N of all countably infinite sequences of 0
and 1 values indexed by the natural numbers, and the measurable sets F
are all sets that can be generated by countable unions1 of cylinder sets,
where a cylinder set consists of all extensions xy of some finite prefix x. The
probability measure itself is obtained by assigning the set of all points that
start with x the probability 2−|x| , and computing the probabilities of other
sets from the axioms.2
An oddity that arises in general probability spaces is it may be that every
particular outcome has probability zero but their union has probability 1.
For example, the probability of any particular infinite string of bits is 0, but
the set containing all such strings is the entire space and has probability
1. This is where the fact that probabilities only add over countable unions
comes in.
Most randomized algorithms books gloss over general probability spaces,
with three good reasons. The first is that if we truncate an algorithm after
a finite number of steps, we are usually get back to a discrete probability
space, which avoids a lot of worrying about measurability and convergence.
The second is that we are often implicitly working in a probability space that
is either discrete or well-understood (like the space of bit-vectors described
above). The last is that the Kolmogorov extension theorem says that
1
As well as complements and countable intersections. However, it is not hard to show
that that sets defined using these operations can be reduced to countable unions of cylinder
sets.
2
This turns out to give the same probabilities as if we consider each outcome as a
real number in the interval [0, 1] and use Lebesgue measure to compute the probability of
events. For some applications, thinking of our random values as real numbers (or even
sequences of real numbers) can make things easier: consider for example what happens
when we want to choose one of three outcomes with equal probability.
CHAPTER 2. PROBABILITY THEORY 18
Lemma 2.2.1.
h i
Pr Ā = 1 − Pr [A] .
For example, if our probability space consists of the six outcomes of a fair
i and A = [outcome is 3] with Pr [A] = 5/6, then Pr [outcome is not 3] =
diehroll,
Pr Ā = 1 − 1/6 = 5/6. Though this example is trivial, using the formula
does save us from having to add up the five cases where we don’t get 3.
If we want to know the probability of A ∩ B, we need to know more
about the relationship between A and B. For example, it could be that
A and B are both events representing a fair coin coming up heads, with
Pr [A] = Pr [B] = 1/2. The probability of A ∩ B could be anywhere between
1/2 and 0:
• For ordinary fair coins, we’d expect that half the time that A happens,
B also happens. This gives Pr [A ∩ B] = (1/2) · (1/2) = 1/4. To
make this formal, we might define our probability space Ω as having
four outcomes HH, HT, TH, and TT, each of which occurs with equal
probability.
• But maybe A and B represent the same fair coin: then A ∩ B = A and
Pr [A ∩ B] = Pr [A] = 1/2.
3
Countable need not be infinite, so 2 is countable.
CHAPTER 2. PROBABILITY THEORY 19
• At the other extreme, maybe A and B represent two fair coins welded
together so that if one comes up heads the other comes up tails. Now
Pr [A ∩ B] = 0.
The difference between the nice case where Pr [A ∩ B] equals 1/4 and
the other, more annoying cases where it doesn’t is that in the first case we
have assumed that A and B are independent, which is defined to mean
that Pr [A ∩ B] = Pr [A] Pr [B].
In the real world, we expect events to be independent if they refer to
parts of the universe that are not causally related: if we flip two coins that
aren’t glued together somehow, then we assume that the outcomes of the
coins are independent. But we can also get independence from events that
are not causally disconnected in this way. An example would be if we rolled
a fair four-sided die labeled HH, HT, TH, TT, where we take the first letter
as representing A and the second as B.
There’s no simple formula for Pr [A ∪ B] when A and B are not disjoint,
even for independent events, but we can compute the probability by splitting
up into smaller, disjoint events and using countable additivity:
h i
Pr [A ∪ B] = Pr (A ∩ B) ∪ (A ∩ B̄) ∪ (Ā ∩ B)
h i h i
= Pr [A ∩ B] + Pr A ∩ B̄ + Pr Ā ∩ B
h i h i
= Pr [A ∩ B] + Pr A ∩ B̄ + Pr Ā ∩ B + Pr [A ∩ B] − Pr [A ∩ B]
= Pr [A] + Pr [B] − Pr [A ∩ B] .
BT with T 6= ∅.
That the right-hand side gives the probability of this event is a sneaky con-
sequence of the binomial theorem, and in particular the fact that ni=1 (−1)i =
P
Pn i n
i=0 (−1) − 1 = (1 − 1) − 1 is −1 if n > 0 and 0 if n = 0. Using this fact
after rewriting the right-hand side using the BT events gives
" #
|S|+1
(−1)|S|+1
X \ X X
(−1) Pr Ai = Pr [BT ]
S⊆{1...n},S6=∅ i∈S S⊆{1...n},S6=∅ T ⊇S
(−1)|S|+1
X X
= Pr [BT ]
T ⊆{1...n} S⊆T,S6=∅
n
!!
X X
i |T |
= − Pr [BT ] (−1)
T ⊆{1...n} i=1
i
− Pr [BT ] ((1 − 1)|T | − 1)
X
=
T ⊆{1...n}
Pr [BT ] (1 − 0|T | )
X
=
T ⊆{1...n}
X
= Pr [BT ]
T ⊆{1...n},T 6=∅
" n #
[
= Pr Ai .
i=1
CHAPTER 2. PROBABILITY THEORY 21
Pr [A ∩ B]
Pr [A | B] = , (2.3.1)
Pr [B]
likely, the first random thing the algorithm does is flip a coin, giving two
possible outcomes i B̄. Countable additivity tells us that Pr [A] =
h B and
Pr [A ∩ B] + Pr A ∩ B̄ , which we can rewrite using conditional probability
as
h i h i
Pr [A] = Pr [A | B] Pr [B] + Pr A B̄ Pr B̄ , (2.3.2)
then
which is the law of total probability. Note that the last step works for
each term only if Pr [A | Bi ] is well-defined, meaning that Pr [Bi ] 6= 0. But
any case where Pr [Bi ] = 0 also has Pr [A ∩ Bi ] = 0, so we get the correct
answer if we simply omit these terms h from i both sums.
A special case arises when Pr A B̄ = 0, which occurs, for example,
if A ⊆ B. Then we just have Pr [A] = Pr [A | B] Pr [B]. If we consider an
CHAPTER 2. PROBABILITY THEORY 23
2.3.3 Examples
Here we have some examples of applying conditional probability to algorithm
analysis. Mostly we will be using some form of the law of total probability.
We can compute
Pr [W | Ai ] = Pr [B0 ∩ B1 ∩ · · · ∩ Bi | Ai ]
= Pr [B0 ∩ B1 ∩ · · · ∩ Bi ]
= Pr [B0 ] + Pr [B1 ] + · · · + Pr [Bi ]
i
X
= (1/2)i
j=0
1 − (1/2)i+1
= (1/2) ·
1 − 1/2
= 1 − (1/2)i+1 . (2.3.5)
The clean form of this expression suggests strongly that there is a better
way to get it, and that this way involves taking the negation of the intersection
of i + 1 independent events that occur with probability 1/2 each. With a
little reflection, we can see that the probability that your objects don’t fit in
my buffer is exactly (1/2)i+1
From the law of total probability (2.3.3),
∞
X
Pr [W ] = (1 − (1/2)i+1 )(1/2)i+1
i=0
∞
X
=1− (1/4)i+1
i=0
1 1
=1− ·
4 1 − 1/4
= 2/3.
Pr [W ] = ∞
P
i=0 (2/3) Pr [Ci ] = 2/3 since Pr [Ci ] sums to 1, because the union
of these events includes the entire space except for the probability-zero event
C∞ .
Still another approach is to compute the probability that our runs have
exactly the same length ( ∞ −i · 2−i = 1/3), and argue by symmetry that
P
i=1 2
the remaining 2/3 probability is equally split between my run being longer
(1/3) and your run being longer (1/3). Since W occurs if my run is just as
long or longer, Pr[W ] = 1/3 + 1/3 = 2/3. A nice property of this approach is
that the only summation involved is over disjoint events, so we get to avoid
using conditional probability entirely.
Proof. Let (S, T ) be a min cut of size k. Then the degree of each vertex v
is at least k (otherwise (v, G − v) would be a smaller cut), and G contains
at least kn/2 edges. The probability that we contract an S–T edge is thus
at most k/(kn/2) = 2/n, and the probability that we don’t contract one is
5
Unlike ordinary graphs, multigraphs can have more than one edge between two vertices.
CHAPTER 2. PROBABILITY THEORY 26
a e
c d
b f
e
ab c d
f
e
ab c df
ab c def
abc def
Figure 2.1: Karger’s min-cut algorithm. Initial graph (at top) has min cut
h{a, b, c} , {d, e, f }i. We find this cut by getting lucky and contracting edges
ab, df , de, and ac in that order. The final graph (at bottom) gives the cut.
This tells us what happens when we are considering a particular min cut.
If the graph has more than one min cut, this only makes our life easier. Note
that since each min cut turns up with probability at least 1/ n2 , there can’t
be more than n2 of them.6 But even if there is only one, we have a good
6
The suspiciously combinatorial appearance of the 1/ n2 suggests that there should
be some way of associating minimum cuts with particular pairs of vertices, but I’m not
aware of any natural way to do this. Sometimes the appearance of a simple expression
in a surprising context may just stem from the fact that there aren’t very many distinct
simple expressions.
CHAPTER 2. PROBABILITY THEORY 27
Random variables
28
CHAPTER 3. RANDOM VARIABLES 29
S Probability
2 1/36
3 2/36
4 3/36
5 4/36
6 5/36
7 6/36
8 5/36
9 4/36
10 3/36
11 2/36
12 1/36
Table 3.1: Probability mass function for the sum of two independent fair
six-sided dice
probability mass function show in Table 3.1. For a discrete random variable
X, the probability mass function gives enough information to calculate the
probability of any event involving X, since we can just sum up cases using
countable additivity. This gives us another way to compute Pr [S < 5] =
Pr [S = 2] + Pr [S = 3] + Pr [S = 4] = 1+2+3
36 = 16 .
For two random variables, the joint probability mass function gives
Pr [X = x ∧ Y = y] for each pair of values x and y. This generalizes in the
obvious way for more than two variables.
We will often refer to the probability mass function as giving the distri-
bution or joint distribution of a random variable or collection of random
variables, even though distribution (for real-valued variables) technically
refers to the cumulative distribution function F (x) = Pr [X ≤ x], which
is generally not directly computable from the probability mass function for
continuous random variables that take on uncountably many values. To
the extent that we can, we will try to avoid continuous random variables,
and the rather messy integration theory needed to handle them.
Two or more random variables are independent if all sets of events
involving different random variables are independent. In terms of proba-
CHAPTER 3. RANDOM VARIABLES 31
3.3 Measurability
For discrete probability spaces, any function on outcomes can be a random
variable. The reason is that any event in a discrete probability space has
a well-defined probability. For more general spaces, in order to be useful,
events involving a random variable should have well-defined probabilities.
For discrete random variables that take on only countably many values
(e.g., integers or rationals), it’s enough for the event [X = x] (that is, the
set {ω | X(ω) = x}) to be in F for all x. For real-valued random variables,
we ask that the event [X ≤ x] be in F. In these cases, we say that X is
measurable with respect to F, or just measurable F. More exotic random
variables use a definition of measurability that generalizes the real-valued
version, which we probably won’t need.3 Since we usually just assume that
all of our random variables are measurable unless we are doing something
funny with F to represent ignorance, this issue won’t come up much.
3.4 Expectation
The expectation or expected value of a random variable X is given by
P
E [X] = x x Pr [X = x]. This is essentially an average value of X weighted
by probability, and it only makes sense if X takes on values that can be
summed in this way (e.g., real or complex values, or vectors in a real- or
3
The general version is that if X takes on values on another measure space (Ω0 , F 0 ),
then the inverse image X −1 (A) = {ω ∈ Ω | X(ω) ∈ A} of any set A in F 0 is in F. This
means in particular that PrΩ maps through X to give a probability measure on Ω0 by
PrΩ0 [A] = PrΩ [X −1 (A)], and the condition on X −1 (A) being in F makes this work.
CHAPTER 3. RANDOM VARIABLES 32
= a E [X] + b E [Y ] .
The claim continues to hold even in the general case, but the proof is
more work.
One special case of this that comes up often is that X ≥ 0 implies
E [X] ≥ 0.
6
The trick here is that we are trading aPprobability-1 gain of 1 against a probability-0
∞
loss of ∞. So we could declare that E i=0
Xi involves 0 · (−∞) and is undefined.
But this would lose the useful property that expectation isn’t affected by probability-0
outcomes. As often happens in mathematics, we are forced to choose between candidate
definitions based on which bad consequences we most want to avoid, with no way to avoid
all of them. So the standard definition of expectation allows the St. Petersburg paradox
because the alternatives are worse.
CHAPTER 3. RANDOM VARIABLES 35
= E [X] E [Y ] .
Here we can’t use the sum formula directly, because N is a random variable,
and we can’t use the product formula, because the Xi are all different random
variables.
CHAPTER 3. RANDOM VARIABLES 36
If N and the Xi are all independent (which may or may not be the case
for the loop example), and N is bounded by some fixed maximum n, then
we can apply the product rule to get the value of (3.4.2) by throwing in
a few indicator variables. The idea is that the contribution of Xi to the
sum is given by Xi 1[N ≥i] , and because we assume that N is independent
h i
of the Xi , if we need to compute E Xi 1[N ≥i] , we can do so by computing
h i
E [Xi ] E 1[N ≥i] .
So we get
"N # " n #
X X
E Xi = E Xi 1[N ≥i]
i=1 i=1
n
X h i
= E Xi 1[N ≥i]
i=1
n
X h i
= E [Xi ] E 1[N ≥i] .
i=1
For general Xi we have to stop here. But if we also know that the Xi all
have the same expectation µ, then E [Xi ] doesn’t depend on i and we can
bring it out of the sum. This gives
n
X h i n
X h i
E [Xi ] E 1[N ≥i] = µ E 1[N ≥i]
i=1 i=1
= µ E [N ] . (3.4.3)
This equation is a special case of Wald’s equation, which we will see
again in §9.4.2. The main difference between this version and the general
version is that here we had to assume that N was independent of the Xi ,
which may not be true if our loop is a while loop, and termination after a
particular iteration is correlated with the time taken by that iteration.
But for simple cases, (3.4.3) can still be useful. For example, if we throw
one six-sided die to get N , and then throw N six-sided dice and add them
up, we get the same expected total 72 · 72 = 49 4 as if we just multiply two
six-sided dice. This is true even though the actual distribution of the values
is very different in the two cases.
E [aX + bY | A] = a E [X | A] + b E [Y | A] (3.5.2)
This is just another way of saying what we said already: if you want to know
what the expectation of X is conditioned on Y when you get outcome ω,
find the value of Y at ω and condition on seeing that.
Here is a simple example. Suppose that X and Y are independent fair
coin-flips that take on the values 0 and 1 with equal probability. Then our
probability space Ω has four elements, and looks like this:
h0, 0i h0, 1i
h1, 0i h1, 1i
But now suppose we only know X, and want to compute E [Z | X]. When
X = 0, E [Z | X = 0] = 12 · 0 + 12 · 1 = 12 ; and when X = 1, E [Z] X = 1 =
1 1 3
2 · 1 + 2 · 2 = 2 . So drawing E [Z | X] over our probability space gives
1 1
2 2
3 3
2 2
We’ve averaged the value of Z across each row, since each row corresponds
to one of the possible values of X.
If instead we compute E [Z | Y ], we get this picture instead:
1 3
2 2
1 3
2 2
= A(z) E [X | Z = z] + B(z) E [Y | Z = z] ,
which is the value when Z = z of A E [X | Z] + B E [Y | Z].
This means that we can quickly simplify many conditional expectations.
If we go back to the example of the previous section, where Z = X + Y is
the sum of two independent fair coin-flips X and Y , then we can compute
E [Z | X] = E [X + Y | X]
= E [X | X] + E [Y | X]
= X + E [Y ]
1
=X+ .
2
Similarly, E [Z | Y ] = E [X + Y | Y ] = 12 + Y .
In some cases we have enough additional information to run this in
reverse. If we know Z and want to estimate X, we can use the fact that
X and Y are symmetric to argue that E [X | Z] = E [Y | Z]. But then
Z = E [X + Y | Z] = 2 E [X | Z], so E [X | Z] = Z/2. Note that this works
in general only if the events [X = a, Y = b] and [X = b, Y = a] have the
same probabilities for all a and b even if we condition on Z, which in this
case follows from the fact that X and Y are independent and identically
distributed and that addition is commutative. Other cases may be messier.8
Other facts, like X ≥ Y implies E [X | Z] ≥ E [Y | Z], can be proved
using similar techniques.
8
An example with X and Y identically distributed but not independent is to imagine
that we roll a six-sided die to get X, and let Y = X + 1 if X < 6 and Y = 1 if X = 6. Now
knowing Z = X + Y = 3 tells me that X = 1 and Y = 2 exactly, neither of which is Z/2.
CHAPTER 3. RANDOM VARIABLES 41
E [X] = E [E [X | Y ]] . (3.5.6)
= E [X] .
The trick here is that we use (3.5.3) to expand out the original expression in
terms of the events Ay , then notice that E [X | Y ] is equal to E [X | Y = y]
whenever Y = y.
So as claimed, conditioning on a variable gives a way to write averaging
over cases very compactly.
It’s also not too hard to show that iterated expectation works with partial
conditioning:
E [E [X | Y, Z] | Y ] = E [X | Y ] . (3.5.7)
and the fact that E [X] is an orthogonal projection of X means that E [X]
is precisely the constant value µ that minimizes the distance E (X − µ)2 .
We won’t actually use any of these facts in the following, but having another
way to look at conditional expectation may be helpful in understanding how
it works.
CHAPTER 3. RANDOM VARIABLES 43
3.5.4 Examples
• Let X be the value of a six-sided die. Let A be the event that X is
even. Then
X
E [X | A] = x Pr [X = x | A]
x
1
= (2 + 4 + 6) ·
3
= 4.
3.6 Applications
3.6.1 Yao’s lemma
In Section 1.1, we considered a special case of the unordered search problem,
where we have an unordered array A[1..n] and want to find the location
of a specific element x. For deterministic algorithms, this requires probing
n array locations in the worst case, because the adversary can place x in
the last place we look. Using a randomized algorithm, we can reduce this
to (n + 1)/2 probes on average, either by probing according to a uniform
random permutation or just by probing from left-to-right or right-to-left
with equal probability.
Can we do better? Proving lower bounds is a nuisance even for determin-
istic algorithms, and for randomized algorithms we have even more to keep
track of. But there is a sneaky trick that allows us to reduce randomized
lower bounds to deterministic lower bounds in many cases.
The idea is that if we have a randomized algorithm that runs in time
T (x, r) on input x with random bits r, then for any fixed choice of r we
have a deterministic algorithm. So for each n, we find some random X
with |X| = n and show that, for any deterministic algorithm that runs in
time T 0 (x), E [T 0 (X)] ≥ f (n). But then E [T (X, R)] = E [E [T (X, R) | R]] =
E [E [TR (X) | R]] ≥ f (n).
This gives us Yao’s lemma:
Lemma 3.6.1 (Yao’s lemma (informal version)[Yao77]). Fix some problem.
Suppose there is a random distribution on inputs X of size n such that every
deterministic algorithm for the problem has expected cost T (n).
Then the worst-case expected cost of any randomized algorithm is at least
T (n).
CHAPTER 3. RANDOM VARIABLES 46
ω X Y Z =X +Y E [X] E [X | Y = 3] E [X | Y ] E [Z | X] E [X | Z] E [X | X]
(1, 1) 1 1 2 7/2 7/2 7/2 1 + 7/2 2/2 1
(1, 2) 1 2 3 7/2 7/2 7/2 1 + 7/2 3/2 1
(1, 3) 1 3 4 7/2 7/2 7/2 1 + 7/2 4/2 1
(1, 4) 1 4 5 7/2 7/2 7/2 1 + 7/2 5/2 1
(1, 5) 1 5 6 7/2 7/2 7/2 1 + 7/2 6/2 1
(1, 6) 1 6 7 7/2 7/2 7/2 1 + 7/2 7/2 1
(2, 1) 2 1 3 7/2 7/2 7/2 2 + 7/2 3/2 2
(2, 2) 2 2 4 7/2 7/2 7/2 2 + 7/2 4/2 2
(2, 3) 2 3 5 7/2 7/2 7/2 2 + 7/2 5/2 2
(2, 4) 2 4 6 7/2 7/2 7/2 2 + 7/2 6/2 2
(2, 5) 2 5 7 7/2 7/2 7/2 2 + 7/2 7/2 2
(2, 6) 2 6 8 7/2 7/2 7/2 2 + 7/2 8/2 2
(3, 1) 3 1 4 7/2 7/2 7/2 3 + 7/2 4/2 3
(3, 2) 3 2 5 7/2 7/2 7/2 3 + 7/2 5/2 3
(3, 3) 3 3 6 7/2 7/2 7/2 3 + 7/2 6/2 3
(3, 4) 3 4 7 7/2 7/2 7/2 3 + 7/2 7/2 3
(3, 5) 3 5 8 7/2 7/2 7/2 3 + 7/2 8/2 3
(3, 6) 3 6 9 7/2 7/2 7/2 3 + 7/2 9/2 3
(4, 1) 4 1 5 7/2 7/2 7/2 4 + 7/2 5/2 4
(4, 2) 4 2 6 7/2 7/2 7/2 4 + 7/2 6/2 4
(4, 3) 4 3 7 7/2 7/2 7/2 4 + 7/2 7/2 4
(4, 4) 4 4 8 7/2 7/2 7/2 4 + 7/2 8/2 4
(4, 5) 4 5 9 7/2 7/2 7/2 4 + 7/2 9/2 4
(4, 6) 4 6 10 7/2 7/2 7/2 4 + 7/2 10/2 4
(5, 1) 5 1 6 7/2 7/2 7/2 5 + 7/2 6/2 5
(5, 2) 5 2 7 7/2 7/2 7/2 5 + 7/2 7/2 5
(5, 3) 5 3 8 7/2 7/2 7/2 5 + 7/2 8/2 5
(5, 4) 5 4 9 7/2 7/2 7/2 5 + 7/2 9/2 5
(5, 5) 5 5 10 7/2 7/2 7/2 5 + 7/2 10/2 5
(5, 6) 5 6 11 7/2 7/2 7/2 5 + 7/2 11/2 5
(6, 1) 6 1 7 7/2 7/2 7/2 6 + 7/2 7/2 6
(6, 2) 6 2 8 7/2 7/2 7/2 6 + 7/2 8/2 6
(6, 3) 6 3 9 7/2 7/2 7/2 6 + 7/2 9/2 6
(6, 4) 6 4 10 7/2 7/2 7/2 6 + 7/2 10/2 6
(6, 5) 6 5 11 7/2 7/2 7/2 6 + 7/2 11/2 6
(6, 6) 6 6 12 7/2 7/2 7/2 6 + 7/2 12/2 6
we can calculate this out formally (no sensible person would ever do this):
h i ∞
X
E X Ā = n Pr [X = n | X 6= 1]
n=1
∞
X Pr [X = n]
= n
n=2
Pr [X 6= 1]
∞
X (1 − p)n−1 p
= n
n=2
1−p
X∞
= n(1 − p)n−2 p
n=2
X∞
= (n + 1)(1 − p)n−1 p
n=1
∞
X
=1+ n(1 − p)n−1 p
n=1
= 1 + E [X] .
integer values.
n−i+1
means that Xi has a geometric distribution with probability n , giving
n
E [Xi ] = n−i+1 from the analysis in §3.6.2.
To get the total expected number of balls, take the sum
" n # n
X X
E Xi = E [Xi ]
i=1 i=1
n
X n
=
i=1
n−i+1
n
X 1
=n
i=1
i
= nHn .
E [Y ] = E [X | X > n − X + 1] Pr [X > n − X + 1]
+ E [n − X + 1 | n − X + 1 ≥ X] Pr [n − X + 1 ≥ X] .
n/2 + n − 1 n/2 + n
E [Y ] ≤ Pr [X > n − X + 1] + Pr [n − X + 1 ≥ X]
2 2
3 1 3
= n− Pr [X > n − X + 1] + n Pr [n − X + 1 ≥ X]
4 2 4
3
≤ n.
4
Now let Xi be the number of survivors after i pivot steps. Note that
max(0, Xi − 1) gives the number of comparisons at the following pivot step,
so that ∞
P
i=0 Xi is an upper bound on the number of comparisons.
We have X0 = n, and from the preceding argument E [X1 ] ≤ (3/4)n. But
more generally, we can use the same argument to show that E [Xi+1 | Xi ] ≤
(3/4)Xi , and by induction E [Xi ] ≤ (3/4)i n. We also have that Xj = 0 for
all j ≥ n, because we lose at least one element (the pivot) at each pivoting
step. This saves us from having to deal with an infinite sum.
Using linearity of expectation,
"∞ # " n #
X X
E Xi = E Xi
i=0 i=0
n
X
= E [Xi ]
i=0
Xn
≤ (3/4)i n
i=0
≤ 4n.
Chapter 4
Basic probabilistic
inequalities
Here we’re going to look at some inequalities useful for proving properties
of randomized algorithms. These come in two flavors: inequalities involving
probabilities, which are useful for bounding the probability that something
bad happens, and inequalities involving expectations, which are used to
bound expected running times. Later, in Chapter 5, we’ll be doing both,
by looking at inequalities that show that a random variable is close to its
expectation with high probability.1
51
CHAPTER 4. BASIC PROBABILISTIC INEQUALITIES 52
E [X]
Pr [X ≥ α] ≤ .
α
The proof is immediate from the law of total probability (2.3.3). We have
4.1.1 Applications
4.1.1.1 Sum of fair coins
Flip n independent fair coins, and let S be the number of heads we get. Since
E [S] = n/2, we get Pr [S = n] = 1/2. This is much larger than the actual
value 2−n , but it’s the best we can hope for if we only know E [S]: if we let
S be 0 or n with equal probability, we also get E [S] = n/2.
Note that for this to work for infinitely many events we need to use the fact
that 1Ai is non-negative.
If we prefer to avoid any issues with infinite sums of expectations, the
direct way to prove this is to replace Ai with Bi = Ai \ i−1
S
S S j=1 Ai . Then
Ai = Bi , but since the Bi are disjoint and each Bi is a subset of the
corresponding Ai , we get Pr [ Ai ] = Pr [ Bi ] = Pr [Bi ] ≤ Pr [Ai ].
S S P P
The typical use of the union bound is to show that if an algorithm can
fail only if various improbable events occur, then the probability of failure is
no greater than the sum of the probabilities of these events. This reduces
the problem of showing that an algorithm works with probability 1 − to
constructing an error budget that divides the probability of failure among
all the bad outcomes.
then one of these sets must all land in the same bin. Call the event that all
balls in S choose the same bin AS . The probability that AS occurs is exactly
n−k+1 .
Using the union bound, we get
" #
[
Pr [some bin gets at least k balls] = Pr AS
S
X
≤ Pr [AS ]
S
!
n −k+1
= n
k
nk −k+1
≤ n
k!
n
= .
k!
If we want this probability to√ be low, we should choose k so that k! n.
Stirling’s formula says that k! ≥ 2πk(k/e)k ≥ (k/e)k , which gives ln(k!) ≥
k(ln k − 1). If we set k = c ln n/ ln ln n, we get
c ln n
ln(k!) ≥ (ln c + ln ln n − ln ln ln n − 1)
ln ln n
≥ c ln n.
4.3.1 Proof
Here is a proof for the case that f is convex and differentiable. The idea is
that if f is convex, then it lies above the tangent line at E [X]. So we can
define a linear function g that represents this tangent line, and get, for all x:
But then
E [f (X)] ≥ E [g(X)]
= E f (E [X]) + (X − E [X])f 0 (E [X])
Figure 4.1 shows what this looks like for a particular convex f .
This is pretty much all linearity of expectation in action: E [X], f (E [X]),
and f 0 (E [X]) are all constants, so we can pull them out of expectations
whenever we see them.
The proof of the general case is similar, but for a non-differentiable convex
function it takes a bit more work to show that the bounding linear function
g exists.
CHAPTER 4. BASIC PROBABILISTIC INEQUALITIES 56
f (x)
E [f (X)]
g(x)
f (E [X])
E [X]
4.3.2 Applications
4.3.2.1 Fair coins: lower bound
Suppose we flip n fair coins, and we want to get a lower bound on E X 2 ,
worse bound than the 1/2 we can get from applying Markov’s inequality to
X directly.
4.3.2.3 Sifters
Here’s an example of Jensen’s inequality in action in the analysis of an actual
distributed algorithm. For some problems in distributed computing, it’s
useful to reduce coordinating a large number of processes to coordinating
a smaller number. A sifter [AA11] is a randomized mechanism for an
asynchronous shared-memory system that sorts the processes into “winners”
and “losers,” guaranteeing that there is at least one winner. The goal is to
make the expected number of winners as small as possible. The problem
is tricky, because processes can only communicate by reading and writing
shared variables, and an adversary gets to choose which processes participate
and fix the schedule of when each of these processes perform their operations.
The current best known sifter is due to Giakkoupis and Woelfel [GW12].
For n processes, it uses an array A of dlg ne bits, each of which can be read
or written by any of the processes. When a process executes the sifter, it
chooses a random index r ∈ 1 . . . dlg ne with probability 2−r−1 (this doesn’t
exactly sum to 1, so the excess probability gets added to r = dlg ne). The
process then writes a 1 to A[r] and reads A[r + 1]. If it sees a 0 in its read
(or chooses r = dlg ne), it wins; otherwise it loses.
This works as a sifter, because no matter how many processes participate,
some process chooses a value of r at least as large as any other process’s
value, and this process wins. To bound the expected number of winners, take
the sum over all r over the random variable Wr representing the winners
who chose this particular value r. A process that chooses r wins if it carries
out its read operation before any process writes r + 1. If the adversary wants
to maximize the number of winners, it should let each process read as soon
as possible; this effectively means that a process that choose r wins if no
process previously chooses r + 1. Since r is twice as likely to be chosen as
r + 1, conditioning on a process picking r or r + 1, there is only a 1/3 chance
that it chooses r + 1. So at most 1/(1/3) − 1 = 2 = O(1) process on average
choose r before some process chooses r + 1. (A simpler argument shows that
the expected number of processes that win because they choose r = dlg ne is
at most 2 as well.)
Summing Wr ≤ 2 over all r gives at most 2dlg ne winners on average.
Furthermore, if k < n processes participate, essentially the same analysis
shows that only 2dlg ke processes win on average. So this is a pretty effective
tool for getting rid of excess processes.
CHAPTER 4. BASIC PROBABILISTIC INEQUALITIES 58
But it gets better. Suppose that we take the winners of one sifter and
feed them into a second sifter. Let Xk be the number of processes left after
k sifters. We have that X0 = n and E [X1 ] ≤ 2dlg ne, but what can we
say about E [X2 ]? We can calculate E [X2 ] = E [E [X2 | X1 ]] ≤ 2dlg X1 e.
Unfortunately, the ceiling means that 2dlg xe is not a concave function,
but f (x) = 2(lg x + 1) ≥ 2dlg xe is. So E [X2 ] ≤ f (f (n)), and in general
E [Xi ] ≤ f (i) (n), where f (i) is the i-fold composition of f . All the extra
constants obscure what is going on a bit, but with a little bit of algebra it is
not too hard to show that f (i) (n) = O(1) for i = O(log∗ n).4 So this gets rid
of all but a constant number of processes very quickly.
4
The log∗ function counts how many times you need to hit n with lg to reduce it to
one or less. So log∗ 1 = 0, log∗ 2 = 1, log∗ 4 = 2, log∗ 16 = 3, log∗ 65536 = 4, log∗ 265536 = 5,
and after that it starts getting silly.
Chapter 5
Concentration bounds
If we really want to get tight bounds on a random variable X, the trick will
turn out to be picking some non-negative function f (X) where (a) we can
calculate E [f (X)], and (b) f grows fast enough that merely large values of
X produce huge values of f (X), allowing us to get small probability bounds
by applying Markov’s inequality to f (X). This approach is often used to
show that X lies close to E [X] with reasonably high probability, what is
known as a concentration bound.
Typically concentration bounds are applied to sums of random variables,
which may or may not be fully independent. Which bound you may want to
use often depends on the structure of your sum. A quick summary of the
bounds in this chapter is given in Table 5.1. The rule of thumb is to use
Chernoff bounds (§5.2) if you have a sum of independent 0–1 random variables;
the Azuma-Hoeffding inequality (§5.3) if you have bounded variables with a
more complicated distribution that may be less independent; and Chebyshev’s
inequality (§5.1) if nothing else works but you can somehow compute the
variance of your sum (e.g., if the Xi are independent or have easily computed
covariance). In the case of Chernoff bounds, you will almost always end up
using one of the weaker but cleaner versions in §5.2.2 rather than the general
version in §5.2.1.
If none of these bounds work for your particular application, there are
many more out there. See for example the textbook by Dubhashi and
Panconesi [DP09].
59
CHAPTER 5. CONCENTRATION BOUNDS 60
E[S]
eδ
Chernoff Xi ∈ {0, 1}, independent Pr [S ≥ (1 + δ) E [S]] ≤ (1+δ)1+δ
t2
Azuma-Hoeffding |Xi | ≤ ci , martingale Pr [S ≥ t] ≤ exp − 2 P c2i
Var[S]
Chebyshev Pr [|S − E [S]| ≥ α] ≤ α2
P
Table 5.1: Concentration bounds for S = Xi (strongest to weakest)
Expand
h i h i
E (X − E [X])2 = E X 2 − 2X · E [X] + (E [X])2
h i
= E X 2 − 2 E [X] · E [X] + (E [X])2
h i
= E X 2 − (E [X])2 . (5.1.2)
This formula is easier to use if you are estimating the variance from a
sequence of samples; by tracking x2i and xi , you can estimate E X 2
P P
and E [X] in a single pass, without having to estimate E [X] first and then go
back for a second pass to calculate (xi − E [X])2 for each sample. We won’t
use this particular application much, but this explains why the formula is
popular with statisticians.
For any two random variables X and Y , the quantity E [XY ]−E [X] E [Y ]
is called the covariance of X and Y , written Cov [X, Y ]. If we take the
covariance of a variable and itself, covariance becomes variance: Cov [X, X] =
Var [X].
We can use Cov [X, Y ] to rewrite the above expansion as
" #
X X
Var Xi = Cov [Xi , Xj ] (5.1.3)
i i,j
X X
= Var [Xi ] + Cov [Xi , Xj ] (5.1.4)
i i6=j
X X
= Var [Xi ] + 2 Cov [Xi , Xj ] (5.1.5)
i i<j
Note that Cov [X, Y ] = 0 when X and Y are independent; this makes
Chebyshev’s inequality particularly useful for pairwise-independent ran-
dom variables, because then we can just sum up the variances of the individual
variables.
P
A typical application is when we have a sum S = Xi of non-negative
random variables with small covariance; here applying Chebyshev’s inequality
to S can often be used to show that S is not likely to be much smaller than
E [S], which can be handy if we want to show that some lower bound holds
on S with some probability. This complements Markov’s inequality, which
can only be used to get upper bounds.
Pn
For example, suppose S = i=1 Xi , where the Xi are independent
Bernoulli random variables with E [Xi ] = p for all i. Then E [S] = np, and
P
Var [S] = i Var [Xi ] = npq (because the Xi are independent). Chebyshev’s
inequality then says
npq
Pr [|S − E [S]| ≥ α] ≤ .
α2
The highest variance is when p = 1/2. In this case, he probability that
√
S is more than β n away from its expected value n/2 is bounded by 4β1 2 .
We’ll see better bounds on this problem later, but this may already be good
enough for many purposes.
More generally, the approach of bounding S from below by estimating
2
E [S] and either E S or Var [S] is known as the second-moment method.
In some cases, tighter bounds can be obtained by more careful analysis.
CHAPTER 5. CONCENTRATION BOUNDS 63
that once we have flipped one coin the wrong way, we are back where
we started, except that now we have to add that extra coin to X. More
Pr[X 2 =n] n−1
formally, we have, for n > 1, Pr X 2 = n X > 1 = Pr[X>1] = q q p =
and
" n # n
X X
Var Xi = Var [Xi ]
i=1 i=1
n
X i−1
= n .
i=1 (n − i − 1)2
or more balls in a particular bin is at most (m/n − m/n2 )/k 2 < m/nk 2 , and
applying the union bound, the probability that we get k + m/n or more balls
in any of the n bins is less than m/k 2 . Setting this equal to andpsolving for
k gives a probability of at most of getting more than m/n + m/ balls
in any of the bins. This is not as good a bound as we will be able to prove
later, but it’s at least non-trivial.
CHAPTER 5. CONCENTRATION BOUNDS 67
2. Use our sample to find an interval that is likely to contain S(k) . The
idea is to pick indices ` = (k − n3/4 )n−1/4 and r = (k + n3/4 )n−1/4 and
use R(`) and R(r) as endpoints (we are omitting some floors and maxes
here to simplify the notation; for a more rigorous presentation see
[MR95]). The hope is that the interval P = [R(`) , R(r) ] in S will both
contain S(k) , and be small, with |P | ≤ 4n3/4 + 2. We can compute the
elements of P in 2n comparisons exactly by comparing every element
with both R(`) and R(r) .
(with the caveat that we are being sloppy about round-off errors).
Similarly,
h i h √ i
Pr R(h) < S(k) = Pr X > kn−1/4 + n
≤ n−1/4 /4.
Y h i
= E eαXi
i
Y
= pi eα + (1 − pi )e0
i
Y
= (pi eα + 1 − pi )
i
Y
= (1 + (eα − 1) pi )
i
Y α −1)p
≤ e(e i
i
P
(eα −1) i pi
=e
α −1)µ
= e(e .
The sneaky inequality step in the middle uses the fact that (1 + x) ≤ ex
for all x, which itself is one of the most useful inequalities you can memorize.5
What’s nice about this derivation is that at the end, the pi have vanished.
We don’t care what random variables we started with or how many of them
there were, but only about their expected sumhµ. i
Now that we have an upper bound on E eαS , we can throw it into
5
For a proof of this inequality, observe that the function f (x) = ex − (1 + x) has the
derivative ex − 1, which is positive for x > 0 and negative for x < 0. It follows that x = 1
is the unique minimum of f , at which f (1) = 0.
CHAPTER 5. CONCENTRATION BOUNDS 70
2−R/µ
δ ≤ 4.11+
eδ
(1+δ)(1+δ)
δ ≤ 1.81+ R/µ ≥ 4.32−
2 /4
e−δ
2 /3 e−δ
for 0 ≤ δ ≤ 1.81.
Suppose that weqwant this bound to be less than . Then we need
2
2e /3 ≤ or δ ≥ 3 ln(2/)
−δ
µ . Setting δ to exactly this quantity, (5.2.7)
CHAPTER 5. CONCENTRATION BOUNDS 73
becomes q
Pr |S − µ| ≥ 3µ ln(2/) ≤ , (5.2.8)
provided ≥ 2e−µ/3 .
For asymptotic purposes, we can omit the constants, giving
Pr [S ≥ (1 + δ)µ] ≤ Pr [S + T ≥ (1 + δ)µ]
!µ
eδ
<= .
(1 + δ)1+δ
This also works for any of the bounds in §5.2.2 that are derived from
(5.2.1).
In the other direction, if we know E [S] ≥ µ and want to apply the lower
tail bound (5.2.5), we can apply a slightly different construction. Suppose
that E [S] = µ0 ≥ µ. For each Xi , construct a 0-1 random variable Yi such
that (a) all the Yi are independent of each other, (b) Yi ≤ Xi always, and (c)
E [Yi ] = E [Xi ] (µ/µ0 ). The easiest way to do this is to set Yi = Xi Zi where
each Zi is an independent biased coin with expectation µ/µ0 .
Let T = Yi . Then T ≤ S and E [T ] = E [Yi ] = E [Xi ] (µ/µ0 ) = µ.
P P P
CHAPTER 5. CONCENTRATION BOUNDS 74
Pr [S ≤ (1 − δ)µ] ≤ Pr [T ≤ (1 − δ)µ]
!µ
e−δ
≤ .
(1 − δ)1−δ
As with the upper tail bound, this approach also works for simpified
versions of the lower tail bound like (5.2.6).
For the two-sided variants, we are out of luck. The best we can do if we
know a ≤ E [S] ≤ b is to apply each of the one-sided bounds separately.
Proof. Rather than repeat the argument for independent variables, we will
employ a coupling, where we replace the Xi with independent Yi so that
Pn Pn
i=1 Yi gives a bound an i=1 Xi .
For the upper bound, let each Yi = 1 with independent probability
pi . Use the following process to generate a new Xi0 in increasing order
of i: if Yi = 0, set Xi0 = 0. Otherwise set Xi0 = 1 with probability
Pr Xi = 1 X1 = X10 , . . . Xi−1
0 /pi , Then Xi0 ≤ Yi , and
It follows that the Xi0 have the same joint distribution as the Xi , and so
" n # " n #
Xi0
X X
Pr ≥ µ(1 + δ) = Pr Xi ≥ µ(1 + δ)
i=1 i=1
" n #
X
≤ Pr Yi ≥ µ(1 + δ)
i=1
!µ
eδ
≤ .
(1 + δ)1+δ
For the other direction, generate the Xi first and generate the Yi using
the same rejection sampling trick. Now the Yi are independent (because
their joint distribution is) and each Yi is a lower bound on the corresponding
Xi .
The lemma is stated for the general Chernoff bounds (5.2.1) and (5.2.5),
but the easier versions follow from these, so they hold as well, as long as we
are careful to remember that the µ in the upper bound is not necessarily the
same µ as in the lower bound.
5.2.8 Applications
5.2.8.1 Flipping coins
Suppose S is the sum of n independent fair coin-flips. Then E [S] = n/2 and
Pr [S = n] = Pr [S ≥ 2 E [S]] is bounded using (5.2.1) by setting µ = n/2,
√
δ = 1 to get Pr [S = n] ≤ (e/4)n/2 = (2/ e)−n . This is not quite as good as
√
the real answer 2−n (the quantity 2/ e is about 1.213. . . ), but it’s at least
exponentially small.
CHAPTER 5. CONCENTRATION BOUNDS 76
110 111
010 011
100 101
000 001
q
The actual excess over the mean is δ(n/2) = (n/2) 6c ln n/n = 32 cn ln n.
p
the n edges leaving a processor.6 How do we route the packets so that all of
them arrive in the minimum amount of time?
We could try to be smart about this, or we could use randomization.
Valiant’s idea is to first route each process i’s packet to some random
intermediate destination σ(i), then in the second phase, we route it from σ(i)
to its ultimate destination π(i). Unlike π, σ is not necessarily a permutation;
instead, σ(i) is chosen uniformly at random independently of all the other
σ(j). This makes the choice of paths for different packets independent of
each other, which we will need later to apply Chernoff bounds..
Routing is done by a bit-fixing: if a packet is currently at node x and
heading for node y, find the leftmost bit j where xj = 6 yj and fix it, by
sending the packet on to x[xj /yj ]. In the absence of contention, bit-fixing
routes a packet to its destination in at most n steps. The hope is that the
randomization will tend to spread the packets evenly across the network,
reducing the contention for edges enough that the actual time will not be
much more than this.
The first step is to argue that, during the first phase, any particular
packet is delayed at most one time unit by any other packet whose path
overlaps with it. Suppose packet i is delayed by contention on some edge uv.
Then there must be some other packet j that crosses uv during this phase.
From this point on, j remains one step ahead of i (until its path diverges),
so it can’t block i again unless both are blocked by some third packet k (in
which case we charge i’s further delay to k). This means that we can bound
the delays for packet i by counting how many other packets cross its path.7
So now we just need a high-probability bound on the number of packets that
get in a particular packet’s way.
Following the presentation in [MR95], define Hij to be the indicator
variable for the event that packets i and j cross paths during the first phase.
Because each j chooses its destination independently, once we fix i’s path,
P
the Hij are all independent. So we can bound S = j6=i Hij using Chernoff
bounds. To do so, we must first calculate an upper bound on µ = E [S].
The trick here is to observe that any path that crosses i’s path must cross
one of its edges, and we can bound the number of such paths by bounding
how many paths cross each edge. For each edge e, let Te be the number
of paths that cross edge e, and for each j, let Xj be the number of edges
P P
that path j crosses. Counting two ways, we have e Te = j Xj , and so
6
Formally, we have a synchronous routing model with unbounded buffers at each node,
with a maximum capacity of one packet per edge per round.
7
A much more formal version of this argument is given as [MR95, Lemma 4.5].
CHAPTER 5. CONCENTRATION BOUNDS 79
hP i
Xj ≤ N (n/2). By symmetry, all the Te have the same
P
E[ e Te ] =E j
expectation, so we get E [Te ] ≤ N N
(n/2)
n = 1/2.
Now fix σ(i). This determines some path e1 e2 . . . ek for packet i. In
general we do not expect E [Te` | σ(i)] to equal E [Te` ], because conditioning
on i’s path crossing e` guarantees that at least one path crosses this edge that
might not have. However, if we let Te0 be the number of packets j 6= i that
cross e, then we have Te0 ≤ Te always, giving E [Te0 ] ≤ E [Te ], and because Te0
does not depend on i’s path, E [Te0 | σ(i)] = E [Te0 ] ≤hE [Te ] ≤ 1/2. Summing
i
this bound over all k ≤ n edges on i’s path gives E ≤ n/2,
P
H
j6=i ij σ(i)
hP i
which implies E j6=i Hij ≤ n/2 after removing the conditioning on σ(i).
Inequality (5.2.4) says that Pr [X ≥ R] ≤ 2−R when R ≥ 2eµ. Letting
X = j6=i Hij and setting R = 3n gives R = 6(n/2) ≥ 6µ > 2eµ, so
P
hP i
−3n = N −3 . This says that any one packet reaches its
Pr j Hij ≥ 3n ≤ 2
random destination with at most 3n added delay (thus, in at most 4n time
units) with probability at least 1 − N −3 . If we consider all N packets, the
total probability that any of them fail to reach their random destinations in
4n time units is at most N · N −3 = N −2 . Note that because we are using
the union bound, we don’t need independence for this step—which is good,
because we don’t have it.
What about the second phase? Here, routing the packets from the random
destinations back to the real destinations is just the reverse of routing them
from the real destinations to the random destinations. So the same bound
applies, and with probability at most N −2 some packet takes more than
4n time units to get back (this assumes that we hold all the packets before
sending them back out, so there are no collisions between packets from
different phases).
Adding up the failure probabilities and costs for both stages gives a
probability of at most 2/N 2 that any packet takes more than 8n time units
to reach its destination.
The structure of this argument is pretty typical for applications of Cher-
noff bounds: we get a very small bound on the probability that something
bad happens by applying Chernoff bounds to a part of the problem where
we have independence, then use the union bound to extend this to the full
problem where we don’t.
CHAPTER 5. CONCENTRATION BOUNDS 80
Proof. The basic idea is that, for any α, eαx is a convex function. Since we
want an upper bound, we can’t use Jensen’s inequality (4.3.1), but we can
use the fact that X is bounded and we know its expectation. Convexity of
eαx means that, for any x with −c ≤ x ≤ c, eαx ≤ λe−αc + (1 − λ)eαc , where
1 x
x = λ(−c) + (1 − λ)c. Solving for λ in terms of x gives λ = 2 1 − c and
1 x
1 − λ = 2 1 + c . So
1 X −αc 1 X αc
h i
αX
E e ≤E 1− e + 1+ e
2 c 2 c
e−αc + eαc e−αc eαc
= − E [X] + E [X]
2 2c 2c
e−αc + eαc
=
2
= cosh(αc).
8
Note that the requirement that E [Xi ] = 0 can always be satisfied by considering
instead Yi = Xi − E [Xi ].
9
The history of this is that Hoeffding [Hoe63] proved it for independent random variables,
and observed that the proof was easily extended to martingales, while Azuma [Azu67]
actually went and did the work of proving it for martingales.
CHAPTER 5. CONCENTRATION BOUNDS 81
In other words, the worst possible X is a fair choice between ±c, and
in this case we get the hyperbolic cosine of αc as its moment generating
function.
We don’t like hyperbolic cosines much, because we are going to want
to take products of our bounds, and hyperbolic cosines don’t multiply very
nicely. As before with 1 + x, we’d be much happier if we could replace the
cosh with a nice exponential. The Taylor series expansion of cosh x starts
with 1 + x2 /2 + . . . , suggesting that we should approximate it with exp(x2 /2),
2
and indeed it is the case that for all x, cosh x ≤ ex /2 . This can be shown by
comparing the rest of the Taylor series expansions:
ex + e−x
cosh x =
2
∞ ∞
!
1 X xn X (−x)n
= +
2 n=0 n! n=0 n!
∞
X x2n
=
n=0
(2n)!
∞
X x2n
≤
n=0
2n n!
∞
X (x2 /2)n
=
n=0
n!
x2 /2
=e .
.
Theorem 5.3.2. Let X1 . . . Xn be independent random variables with E [Xi ] =
0 and |Xi | ≤ ci for all i. Then for all t,
" n # !
X t2
Pr Xi ≥ t ≤ exp − Pn 2 . (5.3.2)
i=1
2 i=1 ci
make the problem fit the theorem, we replace each Xi by a rescaled version
Yi = 2Xi − 1 = ±1 with equal probability; this makes E [Yi ] = 0 as needed,
CHAPTER 5. CONCENTRATION BOUNDS 83
2. E [St+1 | Ft ] = St .
This means that even if we include any extra information we might have
at time t, we still can’t predict St+1 any better than by guessing the current
value St . This alternative definition will be important in some special cases,
as when St is a function of some other collection of random variables that
we use to define the Ft . Because Ft includes at least as much information as
S0 , . . . , St , it will always be the case that any sequence {(St , Ft )} that is a
martingale in the general sense gives a sequence {St } that is a martingale in
the more specialized E [St+1 | S0 , . . . , St ] = St sense.
Martingales were invented to analyze fair gambling games, where your
return over some time interval is not independent of previous outcomes (for
example, you may change your bet or what game you are playing depending
on how things have been going for you), but it is always zero on average
given previous information.10 The nice thing about martingales is they allow
10
Real casinos give negative expected return, so your winnings in a real casino form
a supermartingale with St ≥ E [St+1 | S0 . . . St ]. On the other hand, the casino’s take,
in a well-run casino, is a submartingale, a process with St ≤ E [St+1 | S0 . . . St ]. These
definitions also generalize in the obvious way to the {(St , Ft )} case.
CHAPTER 5. CONCENTRATION BOUNDS 85
for a bit of dependence while still acting very much like sums of independent
random variables.
Where this comes up with Hoeffding’s inequality is that we might have a
process that is reasonably well-behaved, but its increments are not technically
independent. For example, suppose that a gambler plays a game where
they bet x units 0 ≤ x ≤ 1 at each round, and receives ±x with equal
probability. Suppose also that their bet at each round may depend on the
outcome of previous rounds (for example, they might stop betting entirely
if they lose too much money). If Xi is their take at round i, we have that
E [Xi | X1 . . . Xi−1 ] = 0 and that |Xi | ≤ 1. This is enough to apply the
martingale version of Hoeffding’s inequality, often called Azuma’s inequality.
Pk
Theorem 5.3.3. Let {Sk } be a martingale with Sk = i=1 Xi and |Xi | ≤ ci
for all i. Then for all n and all t ≥ 0:
!
−t2
Pr [Sn ≥ t] ≤ exp Pn 2 . (5.3.5)
2 i=1 ci
h i 2
Proof. Basically, we just show that E eαSn ≤ exp α2 ni=1 c2i —just like
P
in the proof of Theorem 5.3.2—and the rest follows using the same argument.
The
hQonly trickyi part is weh cani no longer use independence to transform
n αX into ni=1 E eαXi .
Q
E i=1 e i
Some extensions:
• The asymmetric version of Hoeffding’s inequality (5.3.4) also holds for
martingales. So if each increment Xi satisfies ai ≤ Xi ≤ bi always,
" n # !
X 2t2
Pr Xi ≥ t ≤ exp − Pn 2
. (5.3.6)
i=1 i=1 (bi − ai )
11
The corresponding notion in the other direction is a submartingale. See §9.2.
12
This is known as a Doob decomposition and can be used to extract a martingale
{Zi } from any stochastic process {Xi }. For general processes, Yi = Xi − Zi will still be
predictable, but may not satisfy E [Yi | X1 , . . . , Xi−1 ] ≤ 0.
CHAPTER 5. CONCENTRATION BOUNDS 87
• Suppose that we stop the process after the first time τ with Sτ =
Pτ
i=1 Xi ≥ t. This is equivalent to making a new variable Yi that is
zero whenever Si−1 ≥ t and equal to Xi otherwise. This doesn’t affect
the conditions E [Yi | Y1 . . . Yi−1 ] = 0 or |Yi | ≤ ci , but it makes it so
Pn
Yi ≥ t if and only if maxk≤n ki=1 Xi ≥ t. Applying (5.3.5) to
P
P i=1
Yi then gives
k
" # !
X t2
Pr max Xi ≥ t ≤ exp − Pn 2 . (5.3.8)
k≤n
i=1
2 i=1 ci
There are also cases where the asymmetric version works with ai ≤
Xi ≤ bi where a bound on bi − ai is fixed but the precise values of
ai and bi may vary depending on X1 , . . . , Xi−1 . This shows up in the
proof of McDiarmid’s inequality [McD89], which is described below in
§5.3.3.
that f has the bounded difference property, which says that there are
bounds ct such that for any x1 . . . xn and any x0t , we have
so
when xt+1 happens to be the value of Xt+1 . So the first conditional expecta-
tion in (5.3.12) is just Yt+1 , giving
|Yt+1 − Yt | ≤ ct+1 .
CHAPTER 5. CONCENTRATION BOUNDS 90
This turns out to overestimate the possible range of Yt+1 . With a more
sophisticated argument, it can be shown that for any fixed x1 , . . . , xt , there
exist bounds at+1 ≤ Yt+1 − Yt ≤ bt+1 such that bt+1 − at+1 = ct+1 . We
would like to use this to apply the asymmetric version of Azuma-Hoeffding
given in (5.3.6). A complication is that the specific values of at+1 and bt+1
may depend on the previous values x1 , . . . , xt , even if the bound ct+1 on
their maximum difference does not. Fortunately, McDiarmid shows that the
inequality works anyway, giving:
5.3.4 Applications
Here are some applications of the preceding inequalities. Most of these are
examples of the method of bounded differences.
location, how likely is it that your distance to the nearest mailbox deviates
substantially from the average distance?
We can describe your position as a bit vector X1 , . . . , Xn , where each
Xi is an independent random bit. Let f (X1 , . . . , Xn ) be the distance from
X1 , . . . , Xn to the nearest element of A. Then changing one of the bits
changes this function by at most 1. So we have Pr [|f − E [f ]| ≥ −2t2 /n
√ t] ≤ 2e
by (5.3.13), giving a range of possible distances that is O( n log n) with
probability at least 1 − n−c for any fixed c > 0.14 Of course, without knowing
what A is, we don’t know what E[f ] is; but at least we can be assured that
(unless A is very big) the distance we have to walk to send our mail will be
pretty much the same pretty much wherever we start.
of n and p than are given by this crude result. For a more recent paper that
cites many of these see [Hec21].
1 n−1
X
T (n) = (n − 1) + (T (k) + T (n − 1 − k)) .
n k=0
so far and the summation gives an upper bound on the expected number of
comparisons remaining.
To show that this is in fact a supermartingale, observe that if we partition
a block of size n we add n − 1 to Ct but replace the cost bound 2n ln n by
CHAPTER 5. CONCENTRATION BOUNDS 93
an expected
1 n−1
Z n
X 4
2· 2k ln k ≤ n ln n
n k=0 n 2
!
4 n2 ln n n2
= − − ln 2 + 1
n 2 4
= 2n ln n − n − ln 2 + 1
< 2n ln n − n.
The net change is less than − ln 2. The fact that it’s not zero suggests
that we could improve the 2n ln n bound slightly, but since it’s going down,
we have a supermartingale.
Let’s try to get a bound on how much Xt changes at each step. The Ct
part goes up by at most n − 1. The summation can only go down; if we split
a block of size ni , the biggest drop we get is if we split it evenly,15 This gives
a drop of
n−1 n−1 n−1
2n ln n − 2 2 ln = 2n ln n − 2(n − 1) ln n
2 2 2n
2n
= 2n ln n − 2(n − 1)(ln n − ln )
n−1
2n 2n
= 2n ln n − 2n ln n + 2n ln + 2 ln n − 2 ln
n−1 n−1
= 2n · O(1) + O(log n)
= O(n).
analysis of Hoare’s FIND (§3.6.4). For each element, the number of elements
in the same block is multiplied by a factor of at most 3/4 on average each
time the element is compared, so the chance that the element is not by itself is
at most (3/4)k n after k comparisons. Setting k = log4/3 (n2 /) gives that any
particular element is compared k or times with probability at most /n. The
union bound then gives a probability of at most that the most-compared
element is compared k or more times. So the total number of comparisons is
O(log(n/)) with probability 1 − , which becomes O(log n) with probability
1 − n−c if we set = n−c for a fixed c.
We can formalize this argument itself using a supermartingale. Let Cit
be the number of times i has been compared as a non-pivot in the first t
pivot steps and Nit be the size of the block containing i after t pivot steps.
t
Let Yit = (4/3)Ci Nit . Then if we pick i’s block at step t + 1, the exponent
goes up by at most 1 and N t drops by a factor of 3/4, canceling out the
increase.h If we don’t
i pick i’s block, Yit is unchanged. In either case we get
t t+1 t
Yi ≥ E Yi Ft and Yi is a supermartingale.
h i
Now let Z t = i Yit . Since this is greater than or equal to i E Yit+1 Ft =
P P
E [Z n ]
≤
na
Z 0
≤ a
n
= na−2 .
h i
Choosing a = c+2 gives an n−c bound on Pr maxi Cin ≥ (c + 2) log4/3 n
hP i
n
and thus the same bound on Pr i Ci ≥ (c + 2)n log4/3 n . This shows that
the total number of comparisons is O(n log n) with high probability.
Ti · (µ∗ − µi ), (5.3.15)
where Ti counts the number of times we pull arm i, where µ∗ is the expected
payoff of the best arm, and µi = E [Xi ] is the expected payoff of arm i.
The tricky part here is that when we pull an arm and get a bad return,
we don’t know if we were just unlucky this time or it’s actually a bad arm.
So we have an incentive to try lots of different arms. On the other hand, the
more we pull a genuinely inferior arm, the worse our overall return. We’d like
to adopt a strategy that trades off between exploration (trying new arms)
and exploitation (collecting on the best arm so far) to do as best we can in
comparison to a strategy that always pulls the best arm.
ni
Now consider xi = n1i j=1
P
X where each Xj lies between 0 and 1. This
P ij
is equivalent to having xi = nj=1 Yj where Yj = Xj /ni lies between 0 and
1/ni . So (5.3.17) says that
√
" s #
2 ln n 2 2
Pr xi − E [xi ] ≥ ≤ e−2( (2 ln n)/ni ) /(ni (1/ni ) )
ni
= e−4 ln n
= n−4 . (5.3.19)
the bonus given to an arm that has been pulled s times in the first t pulls.
Fix some optimal arm. Let X i,s be the average return on arm i after s
∗
pulls and X s be the average return on the optimal arm after s pulls.
If we pull arm i after t total pulls, when arm i has previously been pulled
si times and our optimal arm has been pulled s∗ times, then we must have
∗
X i,si + ct,si ≥ X s∗ + Ct,s∗ . (5.3.22)
This just says that arm i with its bonus looks better than the optimal
arm with its bonus.
To show that this bad event doesn’t happen, we need three things:
∗
1. The value of X s∗ +ct,s∗ should be at least µ∗ . Were it to be smaller, the
∗
observed value X s∗ would be more than ct,s∗ away from its expectation.
Hoeffding’s inequality implies this doesn’t happen too often.
2. The value of X i,si + ct,si should not be too much bigger than µi . We’ll
again use Hoeffding’s inequality to show that X i,si is likely to be at
most µi + ct,si , making X i,si + ct,si at most µi + 2ct,si .
CHAPTER 5. CONCENTRATION BOUNDS 98
3. The bonus ct,si should be small enough that even adding 2ct,si to µi
is not enough to beat µ∗ . This means that we need to pick si large
enough that 2ct,si ≤ ∆i . For smaller values of si , we will just accept
that we need to pull arm i a few more times before we wise up.
More formally, if none of the following events hold:
∗
X s∗ + ct,s∗ ≤ µ∗ . (5.3.23)
X i,si ≥ µi + ct,si (5.3.24)
∗
µ − µi < 2ct,si , (5.3.25)
∗
then X s∗ + ct,s∗ > µ∗ > µi + 2ct,si > X i,si + ct,si , and we don’t pull arm i
because the optimal arm is better. (We don’t necessarily pull the optimal
arm, but if we don’t, it’s because we pull some other arm that still isn’t arm
i.)
For (5.3.23) and (5.3.24), we repeat the argument in (5.3.19), plugging
in t for n and si or s∗ for ni . This gives a probability of at most 2t−4 that
either or both of these bad events occur.
For (5.3.25), we need to do something a little sneaky, because the state-
ment is not actually true when si is small. So we will give `i free pulls to
arm i, and only start comparing arm i to the optimal arm after we have
done at least this many pulls. The value of `i is chosen so that, when t ≤ n
and si > `i ,
2ct,si ≤ µ∗ − µi ,
which expands to,
s
2 ln t
2 ≤ ∆i ,
si
giving
8 ln t
si ≥ .
∆2i
So we must set `i to be at least
8 ln n 8 ln t
≥ .
∆2i ∆2i
Because `i must be an integer, we actually get
& '
8 ln n 8 ln n
`i = ≤1+ .
∆2i ∆2i
CHAPTER 5. CONCENTRATION BOUNDS 99
This explains (after multiplying by the regret ∆i ) the first term in (5.3.20).
For the other sources of bad pulls, apply the union bound to the 2t−4
error probabilities we previously computed for all choices of t ≤ n, s∗ ≥ 1,
and si > `i . This gives
n X
t−1 t−1 ∞
−4
t2 · t−4
X X X
2t <2
t=1 s∗ =1 si =`i +1 t=1
π2
=2·
6
π2
= .
3
Again we have to multiply by the regret ∆i for pulling the i-th arm, which
gives the second term in (5.3.20).
Pr lim Sn /n = µ = 1 (5.4.1)
n→∞
Sn − µn
lim Pr √ ≤ t = Φ(t), (5.4.2)
n→∞ σ n
and for any fixed constant t, we don’t know when the limit behavior actually
starts working.
But there are variants of these theorems that do bound rate of convergence
that can be useful in some cases. An example is given in §5.5.1.
where the xi are constants with |xi | ≥ 1 and the i are independent ±1 fair
coin-flips, then Pr [| ni=1 i xi | ≤ r] is maximized by making all the xi equal
P
to 1. This shows that any distribution where the xi are all reasonably large
will not be any more concentrated than a binomial distribution.
There has been a lot of more recent work on variants of the Littlewood-
Offord problem, much of it by Terry Tao and Van Vu. See http://terrytao.
wordpress.com/2009/02/16/a-sharp-inverse-littlewood-offord-theorem/
for a summary of much of this work.
Chapter 6
These are data structures that are either trees or equivalent to trees, and use
randomization to maintain balance. We’ll start by reviewing deterministic
binary search trees and then add in the randomization.
102
CHAPTER 6. RANDOMIZED SEARCH TREES 103
y x
x C A y
A B B C
4 1
2 6 2
1 3 5 7 3
randomized QuickSort (see §1.3.1), so the structure of the tree will exactly
mirror the structure of an execution of QuickSort. So, for example, we
can immediately observe from our previous analysis of QuickSort that the
total path length—the sum of the depths of the nodes—is Θ(n log n),
since the depth of each node is equal to 1 plus the number of comparisons it
participates in as a non-pivot, and (using the same argument as for Hoare’s
FIND in §3.6.4) that the height of the tree is O(log n) with high probability.2
When n is small, randomized binary search trees can look pretty scraggly.
Figure 6.3 shows a typical example.
The problem with this approach in general is that we don’t have any
guarantees that the input will be supplied in random order, and in the
worst case we end up with a linked list, giving O(n) worst-case cost for all
operations.
2
The argument for Hoare’s FIND is that any node has at most 3/4 of the descendants
of its parent on average; this gives for any node x that Pr [depth(x) > d] ≤ (3/4)d−1 n, or
a probability of at most n−c that depth(x) > 1 + (c + 1) log(n)/ log(4/3) ≈ 1 + 6.952 ln n
for c = 1, which we need to apply the union bound. The right answer for the actual height
of a randomly-generated search tree in the limit is 4.31107 ln n [Dev88] so this bound
is actually pretty close. The real height is still nearly a factor of three worse than for a
completely balanced tree, which has max depth bounded by 1 + lg n ≈ 1 + 1.44269 ln n.
CHAPTER 6. RANDOMIZED SEARCH TREES 105
1 7
3 6
2 4
6.3 Treaps
The solution to bad inputs is the same as for QuickSort: instead of assuming
that the input is permuted randomly, we assign random priorities to each
element and organize the tree so that elements with higher priorities rise
to the top. The resulting structure is known as a treap [SA96], because it
satisfies the binary search tree property with respect to keys and the heap
property with respect to priorities.3
There’s an extensive page of information on treaps at http://faculty.
washington.edu/aragon/treaps.html, maintained by Cecilia Aragon, the
co-inventor of treaps; they are also discussed at length in [MR95, §8.2]. We’ll
give a brief description here.
To insert a new node in a treap, first walk down the tree according to
the key and insert the node as a new leaf. Then go back up fixing the heap
property by rotating the new element up until it reaches an ancestor with
the same or higher priority. (See Figure 6.4 for an example.) Deletion is the
reverse of insertion: rotate a node until it has 0 or 1 children (by swapping
with its higher-priority child at each step), and then prune it out, connecting
its child, if any, directly to its parent.
Because of the heap property, the root of each subtree is always the
element in that subtree with the highest priority. This means that the
structure of a treap is completely determined by the priorities and the keys,
no matter what order the elements arrive in. We can imagine in retrospect
3
The name “treap” for this data structure is now standard but the history is a little
tricky. According to Seidel and Aragon, essentially the same data structure (though with
non-random priorities) was previously called a cartesian tree by Vuillemin [Vui80], and
the word “treap” was initially applied by McCreight to a different data structure—designed
for storing two-dimensional data—that was called a priority search tree in its published
form. [McC85].
CHAPTER 6. RANDOMIZED SEARCH TREES 106
1,60 1,60
\ \
2,3 --> 3,26
\ /
(3,26) 2,3
5,78 5,78
/ \ / \
1,60 6,18 --> 1,60 7,41
\ \ \ /
3,26 (7,41) 3,26 6,18
/ \ / \
2,3 4,24 2,3 4,24
Figure 6.4: Inserting values into a treap. Each node is labeled with k, p where
k is the key and p the priority. Insertion of values not requiring rotations
are not shown.
CHAPTER 6. RANDOMIZED SEARCH TREES 107
6.3.2 Analysis
The analysis of treaps as carried out by Seidel and Aragon [SA96] is a nice
example of how to decompose a messy process into simple variables, much
like the linearity-of-expectation argument for QuickSort (§1.3.2). The key
observation is that it’s possible to bound both the expected depth of any node
and the number of rotations needed for an insert or delete operation directly
from information about the ancestor-descendant relationship between nodes.
Define two classes of indicator variables. For simplicity, we assume that
the elements have keys 1 through n, which we also use as indices.
is an ancestor of itself.
The nice thing about these indicator variables is that it’s easy to compute
their expectations.
For Ai,j , i will be the ancestor of j if and only if i has a higher priority
than j and there is no k between i and j that has an even higher prior-
ity: in other words, if i has the highest priority of all keys in the interval
[min(i, j), max(i, j)]. To see this, imagine that we are constructing the treap
recursively, by starting with all elements in a single interval and partitioning
each interval by its highest-priority element. Consider the last interval in
this process that contains both i and j, and suppose i < j (the j > i case is
symmetric). If the highest-priority element is some k with i < k < j, then i
and j are separated into distinct intervals and neither is the ancestor of the
other. If the highest-priority element is j, then j becomes the ancestor of i.
The highest-priority element can’t be less than i or greater than j, because
then we get a smaller interval that contains both i and j. So the only case
where i becomes an ancestor of j is when i has the highest priority.
1
It follows that E [Ai,j ] = |i−j|+1 , where the denominator is just the
number of elements in the range [min(i, j), max(i, j)].
For Ci;`,m , i is the common ancestor of both ` and m if and only if it is has
the highest priority in both [min(i, `), max(i, `)] and [min(i, m), max(i, m)].
It turns out that no matter what order i, `, and m come in, these intervals
overlap so that i must have the highest priority in [min(i, `, m), max(i, `, m)].
1
This gives E [Ci;`,m ] = max(i,`,m)−min(i,`,m)+1 .
CHAPTER 6. RANDOMIZED SEARCH TREES 109
6.3.2.1 Searches
− 1.4 So
P
From the Ai,j variables we can compute depth(j) = i Ai,j
n
!
X 1
E [depth(j)] = −1
i=1
|i − j| + 1
j n
X 1 X 1 −1
= +
i=1
j−i+1 i=j+1
i − j + 1
j n−j+1
X 1 X 1
= + −1
k=1
k k=2
k
= Hj + Hn−j+1 − 2.
4 2 6
/ \ / \ / \
*2 6* => 1 4 or 4 7
/ \ / \ / \ / \
1 *3 5* 7 *3 6* *2 5*
/ \ / \
5* 7 1 *3
Figure 6.5: Rotating 4 right shortens the right spine of its left subtree by
removing 2; rotating left shortens the left spine of the right subtree by
removing 6.
the subtree will be carried out from under the target without ever appearing
as a child or parent of the target. Because each rotation removes exactly one
element from one or the other of the two spines, and we finish when both
are empty, the sum of the length of the spines gives the number of rotations.
To calculate the length of the right spine of the left subtree of some
element `, start with the predecessor ` − 1 of `. Because there is no element
between them, either ` − 1 is a descendant of ` or an ancestor of `. In the
former case (for example, when ` is 4 in Figure 6.5), we want to include all
ancestors of ` − 1 up to ` itself. Starting with i Ai,`−1 gets all the ancestors
P
evaluates to zero.
It follows that the expected length of the right spine of the left subtree is
exactly
" n n
# n n
X X X 1 X 1
E Ai,`−1 − Ci;`−1,` = −
i=1 i=1 i=1
|i − (` − 1)| + 1 i=1 max(i, `) − min(i, ` − 1) + 1
`−1 n `−1 n
X 1 X 1 X 1 X 1
= + − −
i=1
` − i i=` i − ` + 2 i=1 ` − i + 1 i=` i − (` − 1) + 1
`−1
X 1 n−`+2
X 1 X `
1 n−`+2
X 1
= + − −
j=1
j j=2
j j=2 j j=2
j
1
=1− .
`
By symmetry, the expected length of the left spine of the right subtree is
1
1 − n−`+1 . So the total expected number of rotations needed to delete the
CHAPTER 6. RANDOMIZED SEARCH TREES 111
HEAD TAIL
33 LEVEL 2
13 33 48 LEVEL 1
13 21 33 48 75 99 LEVEL 0
Figure 6.6: A skip list. The blue search path for 99 is superimposed on an
original image from [AS07].
`-th element is
1 1
2− − ≤ 2.
` n−`+1
until it reaches the first element that is also in a higher level; it then jumps
to the next level up and repeats the process. The nice thing about this
reversed process is that it has a simple recursive structure: if we restrict a
skip list to only those nodes to the left of and at the same level or higher
of a particular node, we again get a skip list. Furthermore, the structure of
this restricted skip list depends only on coin-flips taken at nodes within it,
so it’s independent of anything that happens elsewhere in the full skip list.
We can analyze this process by tracking the number of nodes in the
restricted skip list described above, which is just the number of nodes in the
current level that are earlier than the current node. If we move left, this
drops by 1; if up, this drops to p times its previous value on average. So the
number of such nodes Xk after k steps satisfies the probabilistic recurrence
Skip lists are even easier to split or merge than treaps. It’s enough to
cut (or recreate) all the pointers crossing the boundary, without changing
the structure of the rest of the list.
Chapter 7
Hashing
The basic idea of hashing is that we have keys from a large set U , and we’d
like to pack them in a small set M by passing them through some function
h : U → M , without getting too many collisions, pairs of distinct keys x
and y with h(x) = h(y). Where randomization comes in is that we want this
to be true even if the adversary picks the keys to hash. At one extreme, we
could use a random function, but this will take a lot of space to store.1 So
our goal will be to find functions with succinct descriptions that are still
random enough to do what we want.
The presentation in this chapter is based largely on [MR95, §§8.4-8.5]
(which is in turn based on work of Carter and Wegman [CW77] on universal
hashing and Fredman, Komlós, and Szemerédi [FKS84] on O(1) worst-case
hashing); on [PR04] and [Pag06] for cuckoo hashing; and [MU17, §5.5.3] for
Bloom filters.
114
CHAPTER 7. HASHING 115
Proof. We’ll count collisions in the inverse image of each element z. Since
CHAPTER 7. HASHING 117
all distinct pairs of elements of h−1 (z) collide with each other, we have
δ(h−1 (z), h−1 (z), h) = h−1 (z) · h−1 (z) − 1 .
m−1
Since 1 − |U |−1 is likely to be very close to 1, we are happy if we get the
2-universal upper bound of |H|/m.
Why we care about this: With a 2-universal hash family, chaining using
linked lists costs O(1 + n/m) expected time per operation. The reason is
that the expected cost of an operation on some key x is proportional to the
size of the linked list at h(x) (plus O(1) for the cost of hashing itself). But
the expected size of this linked list is just the expected number of keys y in
the dictionary that collide with x, which is exactly δ(x, S, H)/|H| ≤ n/m.
CHAPTER 7. HASHING 118
value; it’s enough if we can order the strings so that each string gets a value
that isn’t represented among its predecessors.
More formally, suppose we can order the strings x1 , x2 , . . . , xn that we
0
are hashing so that each has a position ij such that xjij 6= xjij for any j 0 < j,
h 0
i
then we have, for each value v, Pr h(xj ) = v h(xj ) = vj 0 , ∀j 0 < j = 1/m.
It follows that the hash values are independent:
h i n
Y h i
n
1 2
Pr h(x ) = v1 , h(x ) = v2 , . . . , h(x ) = vn ) = Pr h(xj ) = vj h(x1 ) = v1 . . . h(xj−1 ) = vj−1
j=1
1
=
mn
n
Y h i
= Pr h(xj ) = vj .
j=1
Now we want to show that when n = 3, this actually works for all possible
distinct strings x, y, and z. Let S be the set of indices i such that yi 6= xi ,
and similarly let T be the set of indices i such that zi 6= xi ; note that both
sets must be non-empty, since y 6= x and z 6= x. If S \ T is nonempty, then
(a) there is some index i in T where zi =6 xi , and (b) there is some index j in
S \ T where yi 6= xi = zi ; in this case, ordering the strings as x, z, y gives
the independence property above. If T \ S is nonempty, order them as x, y,
z instead. Alternatively, if S = T , then yi = 6 zi for some i in S (otherwise
y = z, since they both equal x on all positions outside S). In this case, xi ,
yi , and zi are all distinct.
For n = 4, we can have strings aa, ab, ba, and bb. If we take the
bitwise exclusive OR of all four hash values, we get zero, because each
character is included exactly twice in each position. So the hash values are
not independent, and we do not get 4-independence in general.
However, even though the outputs of tabulation hashing are not 4-
independent, most reasonably small sets of inputs do give independence.
This can be used to show various miraculous properties like working well for
the cuckoo hashing algorithm described in §7.4.
not consuming too much space. The assumption that S is static is critical,
because FKS chooses hash functions based on the elements of S.
If we were lucky in our choice of S, we might be able to do this with
standard hashing. A perfect hash function for a set S ⊆ U is a hash
function h : U → M that is injective on S (that is, x 6= y → h(x) 6= h(y)
when x, y ∈ S). Unfortunately, we can only count on finding a perfect hash
function if m is large:
n
Lemma 7.3.1. If H is 2-universal and |S| = n with 2 ≤ m, then there is
a perfect h ∈ H for S.
Proof. Let h be chosen uniformly at random from H. For each unordered pair
x 6= y in S, let Xxy be the indicator variable for the even that h(x) = h(y),
P
and let C = x6=y Xxy be the total number of collisions in S. Each Xxy
has expectation at most 1/m, so E [C] ≤ n2 /m < 1. But we we can write
E [C] as E [C | C = 0] Pr [C = 0] + E [C | C ≥ 1] Pr [C 6= 0] ≥ Pr [C 6= 0]. So
Pr [C 6= 0] ≤ n2 /m < 1, giving Pr [C = 0] > 0. But if C is zero with nonzero
The time to do a search is O(1) in the worst case: O(1) for the outer hash
plus O(1) for the inner hash.
The last equality holds because each ordered pair of distinct values in S that
map to the same bucket i corresponds to exactly one collision in δ(S, S, h).
Since H is 2-universal, we have δ(S, S, H) ≤ |H| |S|(|S|−1)
n = |H| n(n−1)
n =
|H|(n − 1). But then the Pigeonhole principle says there exists some h ∈ H
1
with δ(S, S, h) ≤ |H| δ(S, S, H) = n − 1. Choosing this h gives ni=1 n2i ≤
P
n + (n − 1) = 2n − 1 = O(n).
The main difference is that it uses just one table instead of the two tables—one
for each hash function—in [PR04].
7.4.1 Structure
We have a table T of size n, with two separate, independent hash functions h1
and h2 . These functions are assumed to be k-universal for some sufficiently
large value k; as long as we never look at more than k values at once, this
means we can treat them effectively as random functions. In practice, using
crummy hash functions seems to work just fine, a common property of hash
tables. There are also specific hash functions that have been shown to work
with particular variants of cuckoo hashing [PR04, PT12]. We will avoid these
issues by assuming that our hash functions are actually random.
Every key x is stored either in T [h1 (x)] or T [h2 (x)]. So the search
procedure just looks at both of these locations and returns whichever one
contains x (or fails if neither contains x).
To insert a value x1 = x, we must put it in T [h1 (x1 )] or T [h2 (x1 )]. If one
or both of these locations is empty, we put it there. Otherwise we have to
kick out some value that is in the way (this is the “cuckoo” part of cuckoo
hashing, named after the bird that leaves its eggs in other birds’ nests). We
do this by letting x2 = T [h1 (x1 )] and writing x1 to T [h1 (x1 )]. We now have
a new “nestless” value x2 , which we swap with whatever is in T [h2 (x2 )]. If
that location was empty, we are done; otherwise, we get a new value x3 that
we have to put in T [h1 (x3 )] and so on. The procedure terminates when we
find an empty spot or if enough iterations have passed that we don’t expect
to find an empty spot, in which case we rehash the entire table. This process
can be implemented succinctly as shown in Algorithm 7.1.
A detail not included in the above code is that we always rehash (in
theory) after m2 insertions; this avoids potential problems with the hash
functions used in the paper not being universal enough. We will avoid this
issue by assuming that our hash functions are actually random (instead of
being approximately n-universal with reasonably high probability). For a
more principled analysis of where the hash functions come from, see [PR04].
An alternative hash family that is known to work for a slightly different
variant of cuckoo hashing is tabulation hashing, as described in §7.2.2; the
proof that this works is found in [PT12].
CHAPTER 7. HASHING 124
1 procedure insert(x)
2 if T (h1 (x) = x or T (h2 (x)) = x then
3 return
4 pos ← h1 (x)
5 for i ← 1 . . . n do
6 if T [pos] = ⊥ then
7 T [pos] ← x
8 return
9 x T [pos]
10 if pos = h1 (x) then
11 pos ← h2 (x)
12 else
13 pos ← h1 (x)
7.4.2 Analysis
The main question is how long it takes the insertion procedure to terminate,
assuming the table is not too full.
First let’s look at what happens during an insert if we have many nestless
values. We have a sequence of values x1 , x2 , . . . , where each pair of values
xi , xi+1 collides in h1 or h2 . Assuming we don’t reach the loop limit, there
are three main possibilities (the leaves of the tree of cases below):
2. Eventually we see the same key twice; there is some i and j > i such
that xj = xi . Since xi was already moved once, when we reach it
the second time we will try to move it back, displacing xi−1 . This
process continues until we have restored x2 to T [h1 (x1 )], displacing x1
to T [h2 (x1 )] and possibly creating a new sequence of nestless values.
Two outcomes are now possible:
Let’s look at the probability that we get the last case, a closed loop.
Following the argument of Pagh and Rodler, we let v be the number of
distinct nestless keys in the loop. Since v includes x1 and at least one other
element blocking x1 from being inserted at T [h1 (x1 )], v is at least 2. We can
now count how many different ways such a loop can form, and argue that in
each case we include enough information to reconstruct h1 (ui ) and h2 (ui )
for each of a specific set of unique elements u1 , . . . uv .
Formally, this means that we are expression the closed-loop case as a
union of many specific closed loops, and then bounding the probability of
each of these specific closed-loop events by the probability of the event that
h1 and h2 select the right values to make this particular closed loop possible.
Then we apply the union bound.
To describe each of the specific events, we’ll provide this information:
h1 (ui ) or h2 (u1 ) from the value of ui and its first location and the other
hash value for ui given the next location in the list.4
The series converges if 2n/m < 1, so for any fixed α < 1/2, the probability
of any closed loop forming is O(m−2 ).
If we do hit a closed loop, then we pay O(m) time to scan the existing
table and create a new empty table, and O(n) = O(m) time on average to
reinsert all the elements into the new table, assuming that this reinsertion
process doesn’t generate any more closed loops and that the average cost of an
insertion that doesn’t produce a closed loop is O(1), which we will show below.
But the rehashing step only fails with probability O(nm−2 ) = O(m−1 ), so if
it does fail we can just try again until it works, and the expected total cost
is still O(m). Since we pay this O(m) for each insertion with probability
O(m−2 ), this adds only O(m−1 ) to the expected cost of a single insertion.
Now we look at what happens if we don’t get a closed loop. This doesn’t
force us to rehash, but if the path is long enough, we may still pay a lot to
do an insertion.
It’s a little messy to analyze the behavior of keys that appear more than
once in the sequence, so the trick used in the paper is to observe that for any
sequence of nestless keys x1 . . . xp , there is a subsequence of size p/3 with no
repetitions that starts with x1 . This will be either the sequence S1 given by
x1 . . . xj−1 —the sequence starting with the first place we try to insert x1 –or
S2 given by x1 = xi+j−1 . . . xp , the sequence starting with the second place
we try to insert x1 . Between these we have a third sequence R where we
4
The original analysis in [PR04] avoids this by alternating between two tables, so that
we can determine which of h1 or h2 is used at each step by parity.
CHAPTER 7. HASHING 127
revert some of the moves made in S1 . Because |S1 | + |R| + |S2 | ≥ p, at least
one of these three subsequences has size p/3. But |R| ≤ |S1 |, so it must be
either S1 or S2 .
We can then argue that the probability that we get a sequence of v distinct
keys in either S1 or S2 is at most 2(n/m)v−1 . The (n/m)v−1 is because we
need to hit a nonempty spot (which happens with probability at most n/m)
for the first v − 1 elements in the path, and since we assume that our hash
functions are random, the choices of these v − 1 spots are all independent.
The 2 is from the union bound over S1 and S2 . If T is length of the longer
of S1 or S2 , we get E [T ] = ∞
P∞ v−1 = O(1),
v=1 Pr [T ≥ v] ≤
P
v=1 2(n/m)
assuming n/m is bounded by a constant less than 1. Since we already need
n/m ≤ 1/2 to avoid the bad closed-loop case, we can use this here as well.
We have to multiply E [T ] by 3 to get the bound on the actual path, but this
disappears into O(1).
An annoyance with cuckoo hashing is that it has high space overhead
compared to more traditional hash tables: in order for the first part of the
analysis above to work, the table must be at least half empty. This can be
avoided at the cost of increasing the time complexity by choosing between
d locations instead of 2. This technique, due to Fotakis et al. [FPSS03], is
known as d-ary cuckoo hashing. For a suitable choice of d, it uses (1 + )n
space and guarantees that a lookup takes O(1/) probes, while insertion
takes (1/)O(log log(1/)) steps in theory and appears to take O(1/) steps in
practice according to experiments done by the authors.
7.6.1 Construction
Bloom filters are a highly space-efficient randomized data structure invented
by Burton H. Bloom [Blo70] for storing sets of keys, with a small probability
for each key not in the set that it will be erroneously reported as being in
the set.
Suppose we have k independent hash functions h1 , h2 , . . . , hk . Our mem-
ory store A is a vector of m bits, all initially zero. To store a key x, set
A[hi (x)] = 1 for all i. To test membership for x, see if A[hi (x)] = 1 for all
i. The membership test always gives the right answer if x is in fact in the
Bloom filter. If not, we might decide that x is in the Bloom filter anyway,
just because we got lucky.
the kn bits that we set while inserting the n values has one chance in m of
hitting position i.
We’d like to simplify this using the inequality 1 + x ≤ ex , but it goes
2
in the wrong direction; instead, we’ll use 1 − x ≥ e−x−x , which holds for
0 ≤ x ≤ 0.683803 and in our application holds for m ≥ 2. This gives
Pr [A[i] = 1] ≤ 1 − (1 − 1/m)kn
≤ 1 − e−kn(1/m)(1+1/m)
= 1 − e−kα(1+1/m)
0
= 1 − e−kα
where α = n/m is the load factor and α0 = α(1 + 1/m) is the load factor
fudged upward by a factor of 1 + 1/m to make the inequality work.
Suppose now that we check to see if some value x that we never inserted
in the Bloom filter appears to be present anyway. This occurs if A[hi (x)] = 1
for all i. Since each A[hi (x)] is an independent sample from A, the probability
that they all come up 1 conditioned on A is
P k
A[i]
. (7.6.1)
m
0
We have an upper bound E [ A[i]] ≤ m 1 − e−kα , and if we were born
P
doing the usual trick of taking a derivative and setting it to zero; to avoid
weirdness with the k in the exponent, it helps to take the logarithm first
(which doesn’t affect the location of the minimum), and it further helps to
0
take the derivative with respect to x = e−α k instead of k itself. Note that
when we do this, k = − α10 ln x still depends on x, and we will deal with this
by applying this substitution at an appropriate point.
Compute
d d
ln (1 − x)k = k ln(1 − x)
dx dx
d 1
= − 0 ln x ln(1 − x)
dx α
1 ln(1 − x) ln x
=− 0 − .
α x 1−x
the false positive rate)6 and for any set A ⊆ U of size n, A ⊆ Si for at least
one Si (allowing us to store A).
Let N = |U |. Then each set Si covers N N
n of the n subsets of size n. If
we could. get them to overlap optimally (we can’t), we’d still need a minimum
N N
of n n = (N )n /(N )n ≈ (1/)n sets to cover everybody, where the
approximation assumes N n. Taking the log gives lg M ≈ n lg(1/),
meaning we need about lg(1/) bits per key for the data structure. Bloom
filters use 1/ ln 2 times this.
There are known data structures that approach this bound asymptotically.
The first of these, due to Pagh et al. [PPR05] also has other desirable
properties, like supporting deletions and faster lookups if we can’t look up
bits in parallel.
More recently, Fan et al. [FAKM14] have described a variant of cuckoo
hashing (see §7.4) called a cuckoo filter. This is a cuckoo hash table that,
instead of storing full keys x, stores fingerprints f (x), where f is a hash
function with `-bit outputs. False positives now arise if we happen to hash a
value x0 with f (x0 ) = f (x) to the same location as x. If f is drawn from a
2-universal family, this occurs with probability at most 2−` . So the idea is
that by accepting an small rate of false positives, we can shrink the space
needed to store each key from the full key length to lg(1/) = ln(1/)/ ln 2,
the asymptotic minimum.
One complication is that, since we are throwing away the original key x,
when we displace a key from h1 (x) to h2 (x) or vice versa, we can’t recompute
h1 (x) and h2 (x) for arbitrary h1 and h2 . The solution proposed by Fan et al.
is to let h2 (x) = h1 (x) ⊕ g(f (x)), where g is a hash function that depends
only on the fingerprint. This means that when looking at a fingerprint f (x)
stored in position i, we don’t need to know whether i is h1 (x) or h2 (x), since
whichever it is, the other location will be i ⊕ g(f (x)). Unfortunately, this
technique and some other techniques used in the paper to crunch out excess
empty space break the standard analysis of cuckoo hashing, so the authors
can only point to experimental evidence that their data structure actually
6
Technically, this gives a weaker bound on false positives. For standard Bloom filters,
assuming random hash functions, each key individually has at most an probability
of appearing as a false positive. The hypothetical data structure we are considering
here—which is effectively deterministic—allows the set of false positives to depend directly
on the set of keys actually inserted in the data structure, meaning that the adversary
could arrange for a specific key to appear as a false positive with probability 1 by choosing
appropriate keys to insert. So this argument may underestimate the space needed to get
make the false positives less predictable. On the other hand, we aren’t charging the Bloom
filter for the space needed to store the hash functions, which could be quite a bit if they
are genuine random functions.
CHAPTER 7. HASHING 132
works. However, a variant of this data structure has been shown to work by
Eppstein [Epp16].
7.6.4 Applications
Historically, Bloom filters were invented to act as a way of filtering queries
to a database table through fast but expensive7 RAM before looking up the
actual values on a slow but cheap tape drive. Nowadays the cost of RAM is
low enough that this is less of an issue in most cases, but Bloom filters are
still popular in networking and in distributed databases.
In networking, Bloom filters are useful in building network switches, where
incoming packets need to be matched against routing tables in fractions of a
nanosecond. Bloom filters work particularly well for this when implemented
in hardware, since the k hash functions can be computed in parallel. False
positives, if infrequent enough, can be handled by some slower backup
mechanism.
In distributed databases, Bloom filters are used in the Bloomjoin algo-
rithm [ML86]. Here we want to do a join on two tables stored on different
machines (a join is an operation where we find all pairs of rows, one in each
table, that match on some common key). A straightforward but expensive
way to do this is to send the list of keys from the smaller table across the
network, then match them against the corresponding keys from the larger
table. If there are ns rows in the smaller table, nb rows in the larger table,
and j matching rows in the larger table, this requires sending ns keys plus j
rows. If instead we send a Bloom filter representing the set of keys in the
smaller table, we only need to send n lg(1/)/ ln 2 bits for the Bloom filter
plus an extra nb rows on average for the false positives. This can be cheaper
than sending full keys across if the number of false positives is reasonably
small.
once it reaches this value, further increments have no effect. The resulting
structure is called a counting Bloom filter, due to Fan et al. [FCAB00].
We can only expect this to work if our chance of hitting the cap is small.
Fan et al. observe that the probability that the m table entries include one
that is at least c after n insertions is bounded by
! c
nk 1 enk 1
m c
≤m
c m c mc
enk c
=m
cm
= m(ekα/c)c .
k
(This uses the bound nk ≤ en
k , which follows from Stirling’s formula.)
For k = α1 ln 2, this is m(e ln 2/c)c . For the specific value of c = 16
(corresponding to 4 bits per entry), they compute a bound of 1.37 × 10−15 m,
which they argue is minuscule for all reasonable values of m (it’s a systems
paper).
The possibility that a long chain of alternating insertions and deletions
might produce a false negative due to overflow is considered in the paper,
but the authors state that “the probability of such a chain of events is so
low that it is much more likely that the proxy server would be rebooted in
the meantime and the entire structure reconstructed.” An alternative way of
dealing with this problem is to never decrement a maxed-out register. This
never produces a false negative, but may cause the filter to slowly fill up
with maxed-out registers, producing a higher false-positive rate.
A fancier variant of this idea is the spectral Bloom filter of Cohen
and Matias [CM03], which uses larger counters to track multiplicities of
items. The essential idea here is that we can guess that the number of times
a particular value x was inserted is equal to minki=1 A[hi (x)]), with some
extra tinkering to detect errors based on deviations from the typical joint
distribution of the A[hi (x)] values. An even more sophisticated approach
gives the count-min sketches of the next section.
of data sets that are too large to store at all (network traffic statistics), or
too large to store in fast memory (very large database tables). By building
an appropriate small data structure using a single pass through the data,
we can still answer queries about with some loss of accuracy. Examples we
will consider include estimating the size of a set presented over time with
possible duplicate elements (§7.7.1) or more general statistical queries based
on aggregate counts of some sort (§7.7.2).
In each of these cases, the answers we get will be approximate. We
will measure the quality of the approximation in terms of parameters (δ, ),
where we demand a relative error of at most with probability at least 1 − δ.
We’d also like our data structure to have size at most polylogarithmic in the
number of samples n and polynomial in δ and .
1
1 P 1 .
m i n̂i
and thus
1 1
= 1 −ri −1
,
ps
P
m i2
which looks suspiciously like the harmonic mean used on the final estimates
in HyperLogLog. As with the original HyperLogLog, it is possible to show
√
that the typical relative error for this sketch is O(1/ m). See [PWY20] for
more details and some further improvements.
If we don’t care about practical engineering issues, there is a known asymp-
totically optimal solution to the cardinality estimation problem [KNW10],
which doesn’t even require assuming a random oracle, but the constants give
worse performance than the systems that people actually use.
used for more complex tasks like finding heavy hitters—indices with high
weight. The easiest case is approximating ai when all the ct are non-negative,
so we’ll start with that.
CHAPTER 7. HASHING 137
7.7.2.2 Queries
Let’s start with point queries. Here we want to estimate ai for some
fixed i. There are two cases; the first handles non-negative increments only,
while the second handles arbitrary increments. In both cases we will get an
estimate whose error is linear in both the error parameter and the `1 -norm
kak1 = i |ai | of a. It follows that the relative error will be low for heavy
P
points, but we may get a large relative error for light points (and especially
large for points that don’t appear in the data set at all).
For the non-negative case, to estimate ai , compute âi = minj c[j, hj (i)].
(This is the min part of coin-min.) Then:
Lemma 7.7.1. When all ct are non-negative, for âi as defined above:
âi ≥ ai , (7.7.1)
and
Proof. The lower bound is easy. Since for each pair (i, ct ) we increment each
c[j, hj (i)] by ct , we have an invariant that ai ≤ c[j, hj (i)] for all j throughout
the computation, which gives ai ≤ âi = minj c[j, hj (i)].
For the upper bound, let Iijk be the indicator for the event that (i 6=
k) ∧ (hj (i) = hj (k)), i.e., that we get a collision between i and k using hj .
The 2-universality property of the hj gives E [Iijk ] ≤ 1/w ≤ /e.
Now let Xij = nk=1 Iijk ak . Then c[j, hj (i)] = ai + Xij . (The fact that
P
Xij ≥ 0 gives an alternate proof of the lower bound.) Now use linearity of
CHAPTER 7. HASHING 138
expectation to get
" n #
X
E [Xij ] = E Iijk ak
k=1
n
X
= ak E [Iijk ]
k=1
Xn
≤ ak (/e)
k=1
= (/e)kak1 .
So Pr [c[j, hj (i)] > ai + kak1 ] = Pr [Xij > e E [Xij ]] < 1/e, by Markov’s
inequality. With d choices for j, and each hj chosen independently, the prob-
ability that every count is too big is at most (1/e)d = e−d ≤ exp(− ln(1/δ)) =
δ.
Now let’s consider the general case, where the increments ct might be
negative. We still initialize and update the data structure as described in
§7.7.2.1, but now when computing âi , we use the median count instead of
the minimum count: âi = median {c[j, hj (i)] | j = 1 . . . n}. Now we get:
Lemma 7.7.2. For âi as defined above,
Proof. The basic idea is that for the median to be off by t, at least d/2
rows must give values that are off by t. We’ll show that for t = 3kak1 , the
expected number of rows that are off by t is at most d/8. Since the hash
functions for the rows are chosen independently, we can use Chernoff bounds
to show that with a mean of d/8, the chances of getting all the way to d/2
are small.
In detail, we again define the error term Xij as above, and observe that
" #
X
E [|Xij |] = E Iijk ak
k
n
X
≤ |ak E [Iijk ]|
k=1
Xn
≤ |ak |(/e)
k=1
= (/e)kak1 .
CHAPTER 7. HASHING 139
Using Markov’s inequality, we get Pr [|Xij |] > 3kak1 ] = Pr [|Xij | > 3e E [Xij ]] <
1/3e < 1/8. In order for the median to be off by more than 3kak1 , we need
d/2 of these low-probability events to occur. The expected number that
occur is µ = d/8, so applying the standard Chernoff bound (5.2.1) with δ = 3
we are looking at
Pr [S ≥ d/2] = Pr [S ≥ (1 + 3)µ]
≤ (e3 /44 )d/8
≤ (e3/8 /2)ln(1/δ)
= δ ln 2−3/8
< δ 1/4 .
(The actual exponent is about 0.31, but 1/4 is easier to deal with). This
immediately gives (7.7.3).
One way to think about this is that getting an estimate within kak1 of
the right value with probability at least 1 − δ requires 3 times the width and
4 times the depth—or 12 times the space and 4 times the time—when we
aren’t assuming increments are non-negative.
Next, we consider inner products. Here we want to estimate a · b, where
a and b are both stored as count-min sketches using the same hash functions.
The paper concentrates on the case where a and b are both non-negative,
which has applications in estimating the size of a join in a database. The
method is to estimate a · b as minj w k=1 ca [j, k] · cb [j, k].
P
For a single j, the sum consists of both good values and bad collisions; we
have w
Pn
k=1 ca [j, k] · cb [j, k] =
P P
k=1 ai bi + p6=q,hj (p)=hj (q) ap bq . The second
term has expectation
X X
Pr [hj (p) = hj (q)] ap bq ≤ (/e)ap bq
p6=q p6=q
X
≤ (/e)ap bq
p,q
≤ (/e)kak1 kbk1 .
be at most 1/φ heavy hitters. But the tricky part is figuring out which
elements they are.
The output at any stage will be approximate in the following sense:
it is guaranteed that any i such that ai ≥ φkak1 is included, and each i
with ai < (φ − ) that previously appeared in the stream is included with
probability at most 1 − δ. This is similar to what we would get if we just
ran a point query on all possible i, but (a) there are many possible i and (b)
we won’t ever output an i we’ve never seen.
The trick is to extend the data structure and update procedure to track
all the heavy elements found so far (stored in a heap, with the minimum
estimate at the top), as well as kak1 = ct . When a new increment (i, c)
P
comes in, we first update the count-min structure and then do a point query
on ai ; if âi ≥ φkak1 , we insert i into the heap. We also delete any elements
at the top of the heap that have a point-query estimate below threshold.
Because âi ≥ ai , every heavy hitter is correctly identified. However, it’s
possible that an index stops being a heavy hitter at some point (because the
threshold φkak1 rose since we included it). In this case it may get removed
from the heap, but if it becomes a heavy hitter again, we’ll put it back.
which an -PLEB data structure with radius ` and centers P returns a point
p with d(q, p) ≤ (1 + )`; then return this point as the approximate nearest
neighbor.
This requires O(log1+ R) instances of the -PLEB data structure and
O(log log1+ R) queries. The blowup as a function of R can be avoided using
a more sophisticated data structure called a ring-cover tree, defined in the
paper. We won’t talk about ring-cover trees because they are (a) complicated
and (b) not randomized. Instead, we’ll move directly to the question of how
we solve -PLEB.
These are useful if p1 > p2 and r1 < r2 ; that is, we are more likely to hash
inputs together if they are closer. Ideally, we can choose r1 and r2 to build
-PLEB data structures for a range of radii sufficient to do binary search as
described above (or build a ring-cover tree if we are doing it right). For the
moment, we will aim for an (r1 , r2 )-PLEB data structure, which returns a
point within r1 with high probability if one exists, and never returns a point
farther away than r2 .
There is some similarity between locality-sensitive hashing and a more gen-
eral dimension-reduction technique known as the Johnson-Lindenstrauss
lemma [JL84]; this says that projecting n points in a high-dimensional space
to O(−2 log n) dimensions using an appropriate random matrix preserves
`2 distances between the points to within relative error (in fact, even a
random matrix with entries in {−1, 0, +1} is enough [Ach03]). Unfortunately,
dimension reduction by itself is not enough to solve approximate nearest
neighbors in sublinear time, because we may still need to search a number of
boxes exponential in O(−2 log n), which will be polynomial in n. But we’ll
look at the Johnson-Lindenstrauss lemma and its many other applications
more closely in Chapter 8.
gj (p) for each j; these buckets are themselves stored in a hash table (by
hashing the value of gj (p) down further) so that they fit in O(n) space.
Suppose now that d(p, q) ≤ r1 for some p. Then
−ρ
=n
= 1/`.
These are not particularly clever hash functions, so the heavy lifting will
be done by the (r1 , r2 )-PLEB construction. Our goal is to build an -PLEB
for any fixed r, which will correspond to an (r, r(1 + ))-PLEB. The main
thing we need to do, following [IM98] as always, is compute a reasonable
p1 ln(1−r/d)
bound on ρ = log log p2 = ln(1−(1+)r/d) . This is essentially just a matter of
hitting it with enough inequalities, although there are a couple of tricks in
the middle.
Compute
ln(1 − r/d)
ρ=
ln(1 − (1 + )r/d)
(d/r) ln(1 − r/d)
=
(d/r) ln(1 − (1 + )r/d)
ln((1 − r/d)d/r )
=
ln((1 − (1 + )r/d)d/r )
ln(e−1 (1 − r/d))
≤
ln e−(1+)
−1 + ln(1 − r/d)
=
−(1 + )
1 ln(1 − r/d)
= − . (7.8.1)
1+ 1+
Note that we used the fact that 1 + x ≤ ex for all x in the denominator
and (1 − x)1/x ≥ e−1 (1 − x) for x ∈ [0, 1] in the numerator. The first fact is
our usual favorite inequality.
The second can be proved in a number of ways. The most visually
intuitive is that (1 − x)1/x and e−1 (1 − x) are equal at x = 1 and equal
in the limit as x goes to 0, while (1 − x)1/x is concave in between 0 and
1 and e−1 (1 − x) is linear. Unfortunately it is rather painful to show that
(1 − x)1/x is in fact concave. An alternative is to rewrite the inequality
(1 − x)1−x ≥ e−1 (1 − x) as (1 − x)1/x−1 ≥ e−1 , apply a change of variables
y = 1/x to get (1 − 1/y)y−1 ≥ e−1 for y ∈ [1, ∞), and then argue that (a)
equality holds in the limit as y goes to infinity, and (b) the left-hand-side is
CHAPTER 7. HASHING 145
We now return to (7.8.1). We’d really like the second term to be small
enough that we can just write nρ as n1/(1+) . (Note that even though it looks
negative, it isn’t, because ln(1 − r/d) is negative.) So we pull a rabbit out of
a hat by assuming that r/d < 1/ ln n.8 This assumption can be justified by
modifying the algorithm so that d is padded out with up to d ln n unused
junk bits if necessary. Using this assumption, we get
Plugging into the formula for (r1 , r2 )-PLEB gives O(n1/(1+) log n log(1/δ))
hash function evaluations per query, each of which costs O(1) time, plus
O(n1/(1+) log(1/δ)) distance computations, which will take O(d) time each.
If we add in the cost of the binary search, we have to multiply this by
O(log log1+ R log log log1+ R), where the log-log-log comes from having to
adjust δ so that the error doesn’t accumulate too much over all O(log log R)
steps. The end result is that we can do approximate nearest-neighbor queries
in
O n1/(1+) log(1/δ)(log n + d) log log1+ R log log log1+ R
time. For reasonably large, this is much better than naively testing against
all points in our database, which takes O(nd) time (although it does produce
an exact result).
8
Indyk and Motwani pull this rabbit out of a hat a few steps earlier, but it’s pretty
much the same rabbit either way.
CHAPTER 7. HASHING 146
Dimension reduction
147
CHAPTER 8. DIMENSION REDUCTION 148
and then using the union bound to show that this same property holds for
all n2 vectors u − v with nonzero probability. This shows the existence of a
good matrix, and we can generate matrices and test them until we find one
that actually works.
2
1
Radial symmetry is immediate from the density √12π e−x /2 of the univariate normal
distribution. If we consider a vector hX1 , . . . , Xd i of independent N (0, 1) random
P variables,
2
1 −x2
= (2π)−d/2 e− xi /2 . But
Qd
then the joint density is given by the product i=1 2π
e i /2
2 2
P
xi = r where r is the distance from the origin, meaning this distribution has the same
density at all points at the same distance.
CHAPTER 8. DIMENSION REDUCTION 149
kZk2 ≤ β(k/d)
which expands to
Pk 2
i=1 Xi
Pd 2
≤ β(k/d).
i=1 Xi
Having a ratio between two sums is a nuisance, but we can multiply out
the denominators to turn it into something we can apply a Chernoff-style
argument to.
k d
"P # " #
k 2
i=1 Xi
X X
Pr Pd 2
≤ β(k/d) = Pr d Xi2 ≤ βk Xi2
i=1 Xi i=1 i=1
h i
= Pr βk(X12 + · · · + Xd2 ) − d(X12 + . . . Xk2 ) ≥ 0
h i
= Pr exp t βk(X12 + · · · + Xd2 ) − d(X12 + . . . Xk2 ) ≥1
h i
≤ E exp(t(βk(X12 + · · · + Xd2 ) − d(X12 + . . . Xk2 )))
h id−k h ik
= E exp(tβkX 2 ) E exp(t(βk − d)X 2 ) ,
where t can be any value greater than 0, the shift from probability to
expectation uses Markov’s inequality, and in the last step we replace each
independent occurrence of Xi with a standard normal random variable X.
h Now2
i we just need to be able to compute the moment generating function
E e sX for X 2 . The quick way to do this is to notice that X 2 has a
chi-squared distribution with one degree of freedom (since the chi-squared
distribution with k degrees of freedom is just the distribution of the sum
of squares of k independent normal random variables), and look up its
m.g.f. (1 − 2s)−1/2 (for s < 1/2).
CHAPTER 8. DIMENSION REDUCTION 150
The same argument applied to g(−t) gives essentially the same bound
for β = 1 + > 1:
h i
Pr kZk2 ≥ β(k/d) ≤ e(k/2)(1−β+ln β) .
Lemma 8.1.2 ([JL84]). For every d, 0 < < 1, and 0 < δ < 1, there exists
a distribution over linear functions f : Rd → Rk with k = O(−2 log(1/δ))
such that for every x ∈ Rd ,
h i
Pr (1 − )kxk2 ≤ kf (x)k2 ≤ (1 + )kxk2 ≥ 1 − δ.
This can be handy for applications where we don’t know the vectors we
will be working with in advance.
8.2 Applications
The intuition is that the Johnson-Lindenstrauss lemma lets us reduce the
dimension of some problem involving distances between n points, where we
are willing to tolerate a small constant relative error, from some arbitrary d
to O(log n) (or O(log(1/δ)) if we just care about the error probability per
pair of points). So applications tend to fall into one of two categories:
If we are lucky, we’ll get both payoffs, winning on both time and space.
For example, suppose we have a set of n points x1 , x2 , . . . , xn representing
the centers of various clusters in a d-dimensional space, and we want to
rapidly classify incoming points y into one of these clusters by finding which
xi y is closest to. If we do this naively, this is a Θ(nd) operation, since it takes
Θ(d) time to compute each distance between y and some xi . If instead we
are willing to accept the inaccuracy associated with Johnson-Lindenstrauss,
we can fix a matrix A in advance, replace each xi with Axi , and find the
xi that is (approximately) closest to y using O(d log n) time to reduce y to
Ay and O(n log n) time to compute the distance between Ay and each Axi .
In this case we are reducing both the time complexity of our classification
CHAPTER 8. DIMENSION REDUCTION 152
algorithm (at least if we don’t count the pre-processing time to generate the
Axi ) and the amount of data we need to store.
An example of space savings is the use of the Johnson-Lindenstrauss
transform in streaming algorithms (see §7.7). Freksen [Fre21] gives a simple
example of estimating kxk2 where x is a vector of counts of items from a set
of size n presented one at a time. If we don’t charge for the space to store
the JLT function f , we can simply add the i-th column of the matrix to our
running total whenever we see item i, and we need only store O −2 log(1/δ)
distinct numerical values of an appropriate precision to estimate kxk2 to
within relative error with probability at least 1 − δ. The problem is that
in reality we do need to represent f somehow, and even for a ±1 matrix
this will take Θ(n−2 log(1/δ)) space. Fortunately it can be shown that
generating f using a 4-independent hash function reduces the space for f to
O(log n), giving the Tug-of-War sketch of Alon et al. [AMS96], one of the
first compact streaming data structures. Though this is a nice application of
the JLT, it’s worth mentioning Cormode and Muthukrishnan [CM05] observe
that this is still significantly more costly for most queries than their own
count-min sketch.
Chapter 9
9.1 Definitions
The general form of a martingale {Xt , Ft } consists of:
E [Xt+1 | Ft ] = Xt (9.1.1)
153
CHAPTER 9. MARTINGALES AND STOPPING TIMES 154
Xt ≤ E [Xt+1 | Ft ] (9.2.1)
Xt ≥ E [Xt+1 | Ft ] . (9.2.2)
In each case, what is “sub” or “super” is the value at the current time
compared to the expected value at the next time. Intuitively, a submartingale
corresponds to a process where you win on average, while a supermartingale
is a process where you lose on average. Casino games (in profitable casinos)
are submartingales for the house and supermartingales for the player.
Sub- and supermartingales can be reduced to martingales by subtracting
off the expected change at each step. For example, if {Xt } is a submartingale
with respect to {Ft }, then the process {Yt } defined recursively by
Y0 = X0
Yt+1 = Yt + Xt+1 − E [Xt+1 | Ft ]
1
Different authors impose different conditions on the range of τ ; for example, Mitzen-
macher and Upfal [MU17] exclude the case τ = ∞. We allow τ = ∞ to represent the
outcome where we never stop. This can be handy for modeling processes where this
outcome is possible, although in practice we will typically insist that it occurs only with
probability zero.
CHAPTER 9. MARTINGALES AND STOPPING TIMES 155
is a martingale, since
(a) Pr [τ < ∞] = 1,
(b) E [|Xτ |] < ∞, and
h i
(c) limt→∞ E Xt · 1[τ >t] = 0.
Lemma 9.3.2. Let (Xht , Ft ) be aimartingale and τ a stopping time for {Ft }.
Then for any n ∈ N, E Xmin(τ,n) = E[X0 ].
Pt
Proof. Define Yt = X0 + i=1 (Xt − Xt−1 )1[τ >t−1]
h
. Then (Yt , Ft ) is a martin-
i
gale, because we can calculate E [Yt+1 | Ft ] = E Yt + (Xt+1 − Xt )1[τ >t] Ft =
Yt + 1[τ >t] · E [Xt+1 − Xt | Ft ] = Yt ; effectively, we are treating 1[τ ≤t−1] as a
sequence of bets, and we know that h adjusting i our bets doesn’t change the
martingale property. But then E Xmin(τ,n) = E [Yn ] = E [Y0 ] = E [X0 ].
3. Conclude that E [X0 ] = E [Xτ ], since they are both limits of the same
sequence.
This holds because either τ ≤ n, and we just get Xτ , or τ > n, and we get
Xn + (Xτ − Xn ) = Xτ .
Taking the expectation of both sides gives
h i h i
E [Xτ ] = E Xmin(τ,n) + E 1[τ >n] (Xτ − Xn )
h i
= E [X0 ] + E 1[τ >n] (Xτ − Xn ) .
So if we can show that the right-hand term goes to zero in the limit, we are
done. h i
For the bounded-range case, we have |Xτ − Xn | ≤ 2M , so E 1[τ >n] (Xτ − Xn ) ≤
2M ·Pr [τ > n]. Since in this case we assume Pr [τ < ∞] = 1, limn→∞ Pr [τ > n] =
0, and the theorem holds.
For bounded increments, we have
h i X
E (Xτ − Xn )1[τ >n] = E (Xt+1 − Xt )1[τ >t]
t≥n
X
≤ E |(Xt+1 − Xt )| · 1[τ >t]
t≥n
X
≤ E c · 1[τ >t]
t≥n
X
≤ cE 1[τ >t] .
t≥n
But E [τ ] = ∞
P
t=0 1[τ >t] . Under the assumption that this sequence converges,
its tail goes to zero, and again the theorem holds.
For the general case, we can expand
h i h i h i
E [Xτ ] = E Xmin(τ,n) + E 1[τ >n] Xτ − E 1[τ >n] Xn
which implies
h i h i h i
lim E [Xτ ] = lim E Xmin(τ,n) + lim E 1[τ >n] Xτ − lim E 1[τ >n] Xn ,
n→∞ n→∞ n→∞ n→∞
assuming all these limits exist and are finite. We’ve already established that
the first limit is E [X0 ], which is exactly what we want. So we just need to
show that the other two limits both converge tohzero. Forithe last limit, we
just use condition (4c), which gives limn→∞ E 1[τ >n] Xn = 0; no further
CHAPTER 9. MARTINGALES AND STOPPING TIMES 158
argument is needed. But we still need to show that the middle limit also
vanishes. h i P h i
Here we use condition (4b). Observe that E 1[τ >n] Xτ = ∞
t=n+1 E 1 [τ =t] Xt .
h i
Compare this with E [Xτ ] = ∞
P
t=0 E 1[τ =t] Xt ; this is an absolutely conver-
gent series (this is why we need condition (4b)), so in the limit the sum of
the terms for i = 0 . . . n converges to E [Xτ ]. But this means that the sum
of the remaining terms for i = n + 1 . . . ∞ converges to zero. So the middle
term goes to zero as n goes to infinity. This completes the proof.
9.4 Applications
Here we give some example of the Optional Stopping Theorem in action. In
each case, the trick is to find an appropriate martingale and stopping time,
and let the theorem do all the work.
= Yt−1 .
Pt
For the ±1 random walk case, we have Vt = 1 always, giving V = t and E Xτ2 =
i=1 i
2
E X0 + E [τ ] when τ is a stopping time satisfying the conditions of the Optional Stopping
Pτ
Theorem. For the general case, the same argument gives E Xτ2 = E X02 + E V
t=1 t
instead: the expected square position of Xt is incremented by the conditional variance at
each step.
4
This would be a random walk with one absorbing barrier.
5
In fact, we always reach b. An easy way to see this is to imagine a sequence of intervals
Pi 2
of length n1 , n2 , . . . , where ni+1 = b + n
. At the end of the i-th interval, we
j=1 j
Pi √
are no lower than − n , so we only need to go up ni+1 positions to reach a by the
j=0 j
CHAPTER 9. MARTINGALES AND STOPPING TIMES 160
This is the same formula as in §3.4.3.1, but we’ve eliminated the bound
on N and allowed for much more dependence between N and the Xi .7
Lemma 9.4.1. Let {Xi } be a martingale with Xi ≥ 0. Then for any fixed
n,
E [X0 ]
Pr max Xi ≥ α ≤ . (9.4.2)
i≤n α
Proof. The idea is to pick a stopping time τ such that maxi≤n Xi ≥ α if and
only if Xτ ≥ α.
Let τ be the first time such that Xτ ≥ α or τ ≥ n. Then τ is a stopping
time for {Xi }, since we can determine from X0 , . . . , Xt whether τ ≤ t or
not. We also have that τ ≤ n always, which is equivalent to τ = min(τ, n).
Finally, Xτ ≥ α means that maxi≤n Xi ≥ Xτ ≥ α, and conversely if there
is some t ≤ n with Xt = maxi≤n Xi ≥ α, then τ is the first such t, giving
Xτ ≥ α.
Lemma 9.3.2 says E [Xτ ] = E [X0 ]. So Markov’s inequality gives Pr [maxi≤n Xi ≥ α] =
Pr [Xτ ≥ α] ≤ E[Xα
τ]
= E[X 0]
α , as claimed.
E [Xn ]
Pr max Xi ≥ α ≤ . (9.4.3)
i≤n α
The proof is similar, but requires showing first that E [Xτ ] ≤ E [Xn ] when
τ ≤ n is a stopping time and {Xi } is a submartingale.
Doob’s martingale inequality is what you get if you generalize Markov’s
inequality to martingales. The analogous generalization of Chebyshev’s
inequality is Kolmogorov’s inequality, which says:
Pi
Lemma 9.4.2. For sums Si = j=1 Xj of independent random variables
X1 , X2 , . . . , Xn with E [Xi ] = 0,
Var [S]
Pr max|Si | ≥ α ≤ . (9.4.4)
i≤n α2
Proof. Let Yi = Si2 − Var [Si ]. Then {Yi } is a martingale. This implies that
2
2
E [Yn ] = E Sn − Var [S] = Y0 = 0 and thus that E Sn = Var [S]. It’s easy
to see that Si2 is a submartingale since partial sums can only increase over
time. Now apply (9.4.3).
all be at −1.
Let χi = 1 if x1 . . . xi = xk−i+1 . . . xk , and 0 otherwise. Then the number
of losers is given by τ − ki=1 χi and the total expected payoff is
P
k k
" #
X X
E [Xτ ] = E −(τ − χi ) + χi 2i − 1
i=1 i=1
k
" #
X
i
= E −τ + χi 2
i=1
= 0.
Markov chains
165
CHAPTER 10. MARKOV CHAINS 166
If you want to learn more about Markov chains than presented here, they
are usually covered in general probability textbooks (for example, in [Fel68]
or [GS01]), mentioned in many linear algebra textbooks [Str03], covered
in some detail in stochastic processes textbooks [KT75], and covered in
exquisite detail in many books dedicated specifically to the subject [KS76,
KSK76]. Good sources for mixing times for Markov chains are the textbook
of Levin, Peres, and Wilmer [LPW09] and the survey paper by Montenegro
and Tetali [MT05]. An early reference on the mixing times for random
walks on graphs that helped inspire much subsequent work is the Aldous-
Fill manuscript [AF01], which can be found on-line at http://www.stat.
berkeley.edu/~aldous/RWG/book.html.
In both cases we want to have a bound on how long it takes the Markov
chain to converge, either because it tells us when our algorithm terminates,
or because it tells us how long to mix it up before looking at the current
state.
10.1.1 Examples
• A fair ±1 random walk. The state space is Z, the transition probabilities
are pij = 1/2 if |i − j| = 1, 0 otherwise. This is an example of a Markov
chain that is also a martingale.
• A fair ±1 random walk on a cycle. As above, but now the state space
is Z/m, the integers mod m. This is a finite Markov chain. It is also in
some sense a martingale, although we usually don’t define martingales
over finite groups.
chain as a random walk on graph in this way: the states become vertices,
and the transitions become edges, each labeled with its transition
probability. It’s conventional in this representation to exclude edges
with probability 0 and include self-loops for any transitions i → i.
If the resulting graph is small enough or has a nice structure, this can
be a convenient way to draw a Markov chain.
where the supremum is taken over all sets A for which Pr [X ∈ A] and
Pr [Y ∈ A] are both defined.6
An equivalent definition is
dT V (X, Y ) = sup|Pr [X ∈ A] − Pr [Y ∈ A]|.
A
So dT V (x, y) = 21 kx − yk1 .
Proof. Compute
X
|Ex (Z) − Ey (Z)| = z Pr(Z = z) − Pr(Z = z)
x y
z
X
≤ |z| · Pr(Z = z) − Pr(Z = z)
x y
z
X
≤M Pr(Z = z) − Pr(Z = z)
x y
z
≤ M kx − yk1
= 2M · dT V (x, y).
means that for any > 0, there is a mixing time tmix () such that for any
initial distribution x and any t ≥ tmix (),
dT V (xP t , π) ≤ .
dT V (X, Y ) ≤ Pr [X 6= Y ] .
7
It turns out that the bound in the Coupling Lemma is tight in the following sense:
for any given distributions on X and Y , there exists a joint distribution giving these
distributions such that dT V (X, Y ) is exactly equal to Pr [X 6= Y ] when X and Y are
sampled from the joint distribution. For discrete distributions, the easiest way to con-
struct the joint distribution is first to let to let Y = X = i for each i with prob-
ability min(Pr [X = i] , Pr [Y = i]), and then distribute the remaining probability for
X over all the cases where Pr [X = i] > Pr [Y = i] and similarly for Y over all the
cases where Pr P[Y = i] > Pr [X = i]. Looking at the unmatched values for X gives
Pr [X 6= Y ] ≤ {x | Pr X=i>Pr Y =i} (Pr [X = i] − Pr [Y = i]) ≤ dT V (X, Y ). So in this case
Pr [X 6= Y ] = dT V (X, Y ).
Unfortunately, the fact that there always exists a perfect coupling in this sense does not
mean that we can express it in any convenient way, or that even if we could, it would arise
from the kind of causal, step-by-step construction that we will use for couplings between
Markov processes.
CHAPTER 10. MARKOV CHAINS 175
Pr [X ∈ A] = Pr [X ∈ A ∧ Y ∈ A] + Pr [X ∈ A ∧ Y 6∈ A] ,
Pr [Y ∈ A] = Pr [X ∈ A ∧ Y ∈ A] + Pr [X 6∈ A ∧ Y ∈ A] ,
and thus
Pr [X ∈ A] − Pr [Y ∈ A] = Pr [X ∈ A ∧ Y 6∈ A] − Pr [X 6∈ A ∧ Y ∈ A]
≤ Pr [X ∈ A ∧ Y 6∈ A]
≤ Pr [X 6= Y ] .
Since this holds for any particular set A, it also holds when we take the
maximum over all A to get dT V (X, Y ).
For Markov chains, our goal will be to find a useful coupling between a se-
quence of random variables X0 , X1 , X2 , . . . corresponding to the Markov chain
starting in an arbitrary distribution with a second sequence Y0 , Y1 , Y2 , . . .
corresponding to the same chain starting in a stationary distribution. What
will make a coupling useful is if Pr [Xt 6= Yt ] is small for reasonably large t:
since Yt has the stationary distribution, this will show that dT V (xP t , π) is
also small.
Our first use of this technique will be to show, using a rather generic
coupling, that Markov chains with certain nice properties converge to their
stationary distribution in the limit. Later we will construct specialized
couplings for particular Markov chains to show that they converge quickly.
But first we will consider what properties a Markov chain must have to
converge at all.
the period of i is m, then starting from i we can only return to i at times that
are multiples of m. If m = 1, state i is said to be aperiodic. A Markov chain
as a whole is aperiodic if all of its states are aperiodic. In graph-theoretic
terms, this means that the graph of the chain is not k-partite for any k > 1.
Reversible chains are also an interesting special case: if a chain is reversible,
it can’t have a period greater than 2, since we can always step off a node
and step back.
If our Markov chain is not aperiodic, we can make it aperiodic by flipping
a coin at each step to decide whether to move or not. This gives a lazy
Markov chain whose transition probabilities are given by 12 pij when i 6= j
and 12 + 12 pij when i = j. This doesn’t affect the stationary distribution: if
we replace our transition
matrix P with a new transition matrix P +I 2 , and
πP = π, then π P +I2 = 21 πP + 12 πI = 12 π + 12 π = π.
Unfortunately there is no quick fix for reducible Markov chains. But
since we will often be designing the Markov chains we will be working with,
we can just take care to make sure they are not reducible.
We will later need the following lemma about aperiodic Markov chains,
which is related to the Frobenius problem of finding the minimum value
that cannot be constructed using coins of given denominations:
Proof. Let S = {t | pii (t) 6= 0}. Since gcd(S) = 1, there is a finite subset S 0
of S such that gcd S 0 = 1. Write the elements of S 0 as m1 , m2 , . . . , mk and let
M = kj=1 mj . From the extended Euclidean algorithm, there exist integer
Q
to use each aj as the number of times to go around the length mj loop from
i to i. Unfortunately many of these aj will be negative.
To solve this problem, we replace aj with bj = aj + M/mj . This makes all
the coefficients non-negative, and gives kj=1 bj mj = kM + 1. This implies
P
Proof. Consider two copies of the chain {Xt } and {Yt }, where X0 starts in
some arbitrary distribution x and Y0 starts in a stationary distribution π.
Define a coupling between {Xt } and {Yt } by the rule: (a) if Xt = 6 Yt , then
Pr [Xt+1 = j ∧ Yt+1 = j 0 | Xt = i ∧ Yt = i0 ] = pij pi0 j 0 ; and (b) if Xt = Yt ,
then Pr [Xt+1 = Yt+1 = j | Xt = Yt = i] = pij . Intuitively, we let both chains
run independently until they collide, after which we run them together. Since
each chain individually moves from state i to state j with probability pij
in either case, we have that Xt evolves normally and Yt remains in the
stationary distribution.
Now let us show that dT V (xP t , π) ≤ Pr [Xt 6= Yt ] goes to zero in the
limit. Pick some state i. Let r be the maximum over all states j of the first
passage time fji where fji is the minimum time t such that ptji 6= 0. Let s
be a time such that ptii 6= 0 for all t ≥ s (the existence of such an s is given
by Lemma 10.2.4).
Suppose that at time `(r + s), where ` ∈ N, X`(r+s) = j = 6 j 0 = Y`(r+s) .
0 0
Then there are times `(r + s) + u and `(r + s) + u , where u, u ≤ r, such that
X reaches i at time `(r + s) + u and Y reaches i at time `(r + s) + u0 with
nonzero probability. Since (r + s − u) ≤ s, then having reached i at these
times, X and Y both return to i at time `(r + s) + (r + s) = (` + 1)(r + s) with
nonzero hprobability. Let > 0 be ithe product of hthese nonzero probabilities;
i
then Pr X(`+1)(r+s) 6= Y(`+1)(r+s) ≤ (1 − ) Pr X`(r+s) = Y`(r+s) , and in
general we have Pr [Xt 6= Yt ] ≤ (1 − )bt/(r+s)c , which goes to zero in the
limit. This implies that dT V (xP t , π) also goes to zero in the limit (using the
Coupling Lemma), and since any initial distribution (including a stationary
distribution) converges to π, π is the unique stationary distribution as
claimed.
CHAPTER 10. MARKOV CHAINS 178
These are called the detailed balance equations—they say that in the
stationary distribution, the probability of seeing a transition from i to j is
equal to the probability of seeing a transition from j to i). If this is the case,
P P
then i πi pij = i πj pji = πj , which means that π is stationary.
It’s worth noting that this works for countable chains even if they are
not finite, because the sums always converge since each term is non-negative
P P
and i πi pij is dominated by i πi = 1. However, it may not be the case
for any particular p that there exists a corresponding stationary distribution
π. If this happens, the chain is not reversible.
d(u)
πu = P
u d(u)
d(u)
= ,
2|E|
which satisfies
d(u 1
πu puv = ·
2|E| d(u)
d(v 1
= ·
2|E| d(v)
= πv pvu .
If we don’t know π in advance, we can often guess it by observing that
πi pij = πj pji
implies
pij
πj = πi , (10.3.2)
pji
provided pij 6= 0. This gives us the ability to calculate πk starting from any
initial state i as long as there is some chain of transitions i = i0 → i1 → i2 →
. . . i` = k where each step im → im+1 has pim ,im+1 6= 0. For a random walk
on a graph, this implies that π is unique as long as the graph is connected.
This of course only works for