0% found this document useful (0 votes)
52 views539 pages

Notes On Randomized Algorithms

Uploaded by

R S
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
52 views539 pages

Notes On Randomized Algorithms

Uploaded by

R S
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 539

Notes on Randomized Algorithms

James Aspnes

2024-07-27 23:04
i

Copyright © 2009–2024 by James Aspnes. Distributed under a Cre-


ative Commons Attribution-ShareAlike 4.0 International license: https:
//creativecommons.org/licenses/by-sa/4.0/.
Contents

Table of contents ii

List of figures xv

List of tables xvi

List of algorithms xvii

Preface xviii

1 Randomized algorithms 1
1.1 Searching an array . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Verifying polynomial identities . . . . . . . . . . . . . . . . . 4
1.3 Randomized QuickSort . . . . . . . . . . . . . . . . . . . . . . 6
1.3.1 Brute force method: solve the recurrence . . . . . . . 6
1.3.2 Clever method: use linearity of expectation . . . . . . 7
1.4 Where does the randomness come from? . . . . . . . . . . . . 9
1.5 Classifying randomized algorithms . . . . . . . . . . . . . . . 10
1.5.1 Las Vegas vs Monte Carlo . . . . . . . . . . . . . . . . 10
1.5.2 Randomized complexity classes . . . . . . . . . . . . . 11
1.6 Classifying randomized algorithms by their methods . . . . . 13

2 Probability theory 15
2.1 Probability spaces and events . . . . . . . . . . . . . . . . . . 16
2.1.1 General probability spaces . . . . . . . . . . . . . . . . 16
2.2 Boolean combinations of events . . . . . . . . . . . . . . . . . 18
2.3 Conditional probability . . . . . . . . . . . . . . . . . . . . . 21
2.3.1 Conditional probability and independence . . . . . . . 21
2.3.2 Conditional probability and the law of total probability 21
2.3.3 Examples . . . . . . . . . . . . . . . . . . . . . . . . . 23

ii
CONTENTS iii

2.3.3.1 Racing coin-flips . . . . . . . . . . . . . . . . 23


2.3.3.2 Karger’s min-cut algorithm . . . . . . . . . . 25

3 Random variables 28
3.1 Operations on random variables . . . . . . . . . . . . . . . . . 29
3.2 Random variables and events . . . . . . . . . . . . . . . . . . 29
3.3 Measurability . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.4 Expectation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.4.1 Linearity of expectation . . . . . . . . . . . . . . . . . 32
3.4.1.1 Linearity of expectation for infinite sequences 33
3.4.2 Expectation and inequalities . . . . . . . . . . . . . . 34
3.4.3 Expectation of a product . . . . . . . . . . . . . . . . 35
3.4.3.1 Wald’s equation (simple version) . . . . . . . 35
3.5 Conditional expectation . . . . . . . . . . . . . . . . . . . . . 36
3.5.1 Expectation conditioned on an event . . . . . . . . . . 37
3.5.2 Expectation conditioned on a random variable . . . . 38
3.5.2.1 Calculating conditional expectations . . . . . 39
3.5.2.2 The law of iterated expectation . . . . . . . . 41
3.5.2.3 Conditional expectation as orthogonal pro-
jection . . . . . . . . . . . . . . . . . . . . . 41
3.5.3 Expectation conditioned on a σ-algebra . . . . . . . . 43
3.5.4 Examples . . . . . . . . . . . . . . . . . . . . . . . . . 44
3.6 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
3.6.1 Yao’s lemma . . . . . . . . . . . . . . . . . . . . . . . 45
3.6.2 Geometric random variables . . . . . . . . . . . . . . . 47
3.6.3 Coupon collector . . . . . . . . . . . . . . . . . . . . . 48
3.6.4 Hoare’s FIND . . . . . . . . . . . . . . . . . . . . . . . 49

4 Basic probabilistic inequalities 51


4.1 Markov’s inequality . . . . . . . . . . . . . . . . . . . . . . . . 51
4.1.1 Applications . . . . . . . . . . . . . . . . . . . . . . . 52
4.1.1.1 Sum of fair coins . . . . . . . . . . . . . . . . 52
4.1.1.2 Randomized QuickSort . . . . . . . . . . . . 52
4.1.1.3 Balls in bins . . . . . . . . . . . . . . . . . . 52
4.2 Union bound (Boole’s inequality) . . . . . . . . . . . . . . . . 53
4.2.1 Example: Balls in bins . . . . . . . . . . . . . . . . . . 53
4.3 Jensen’s inequality . . . . . . . . . . . . . . . . . . . . . . . . 54
4.3.1 Proof . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
4.3.2 Applications . . . . . . . . . . . . . . . . . . . . . . . 56
4.3.2.1 Fair coins: lower bound . . . . . . . . . . . . 56
CONTENTS iv

4.3.2.2 Fair coins: upper bound . . . . . . . . . . . . 56


4.3.2.3 Sifters . . . . . . . . . . . . . . . . . . . . . . 57

5 Concentration bounds 59
5.1 Chebyshev’s inequality . . . . . . . . . . . . . . . . . . . . . . 60
5.1.1 Computing variance . . . . . . . . . . . . . . . . . . . 60
5.1.1.1 Alternative formula . . . . . . . . . . . . . . 60
5.1.1.2 Variance of a Bernoulli random variable . . . 61
5.1.1.3 Variance of a sum . . . . . . . . . . . . . . . 61
5.1.1.4 Variance of a geometric random variable . . 63
5.1.2 More examples . . . . . . . . . . . . . . . . . . . . . . 66
5.1.2.1 Flipping coins . . . . . . . . . . . . . . . . . 66
5.1.2.2 Balls in bins . . . . . . . . . . . . . . . . . . 66
5.1.2.3 Lazy select . . . . . . . . . . . . . . . . . . . 67
5.2 Chernoff bounds . . . . . . . . . . . . . . . . . . . . . . . . . 68
5.2.1 The classic Chernoff bound . . . . . . . . . . . . . . . 69
5.2.2 Easier variants . . . . . . . . . . . . . . . . . . . . . . 71
5.2.3 Lower bound version . . . . . . . . . . . . . . . . . . . 72
5.2.4 Two-sided version . . . . . . . . . . . . . . . . . . . . 72
5.2.5 What if we only have a bound on E [S]? . . . . . . . . 73
5.2.6 Almost-independent variables . . . . . . . . . . . . . . 74
5.2.7 Other tail bounds for the binomial distribution . . . . 75
5.2.8 Applications . . . . . . . . . . . . . . . . . . . . . . . 75
5.2.8.1 Flipping coins . . . . . . . . . . . . . . . . . 75
5.2.8.2 Balls in bins again . . . . . . . . . . . . . . . 76
5.2.8.3 Flipping coins, central behavior . . . . . . . 76
5.2.8.4 Permutation routing on a hypercube . . . . . 77
5.3 The Azuma-Hoeffding inequality . . . . . . . . . . . . . . . . 80
5.3.1 Hoeffding’s inequality . . . . . . . . . . . . . . . . . . 80
5.3.1.1 Hoeffding vs Chernoff . . . . . . . . . . . . . 82
5.3.1.2 Asymmetric version . . . . . . . . . . . . . . 83
5.3.2 Azuma’s inequality . . . . . . . . . . . . . . . . . . . . 83
5.3.3 The method of bounded differences . . . . . . . . . . . 88
5.3.4 Applications . . . . . . . . . . . . . . . . . . . . . . . 90
5.3.4.1 Sprinkling points on a hypercube . . . . . . . 90
5.3.4.2 Chromatic number of a random graph . . . . 91
5.3.4.3 Balls in bins . . . . . . . . . . . . . . . . . . 92
5.3.4.4 Probabilistic recurrence relations . . . . . . . 92
5.3.4.5 Multi-armed bandits . . . . . . . . . . . . . . 94
The UCB1 algorithm . . . . . . . . . . . . . . . 95
CONTENTS v

Analysis of UCB1 . . . . . . . . . . . . . . . . . 96
5.4 Relation to limit theorems . . . . . . . . . . . . . . . . . . . . 99
5.5 Anti-concentration bounds . . . . . . . . . . . . . . . . . . . . 100
5.5.1 The Berry-Esseen theorem . . . . . . . . . . . . . . . . 100
5.5.2 The Littlewood-Offord problem . . . . . . . . . . . . . 101

6 Randomized search trees 102


6.1 Binary search trees . . . . . . . . . . . . . . . . . . . . . . . . 102
6.1.1 Rebalancing and rotations . . . . . . . . . . . . . . . . 103
6.2 Random insertions . . . . . . . . . . . . . . . . . . . . . . . . 103
6.3 Treaps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
6.3.1 Assumption of an oblivious adversary . . . . . . . . . 107
6.3.2 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . 107
6.3.2.1 Searches . . . . . . . . . . . . . . . . . . . . 109
6.3.2.2 Insertions and deletions . . . . . . . . . . . . 109
6.3.2.3 Other operations . . . . . . . . . . . . . . . . 111
6.4 Skip lists . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111

7 Hashing 114
7.1 Hash tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
7.2 Universal hash families . . . . . . . . . . . . . . . . . . . . . . 115
7.2.1 Linear congruential hashing . . . . . . . . . . . . . . . 118
7.2.2 Tabulation hashing . . . . . . . . . . . . . . . . . . . . 119
7.3 FKS hashing . . . . . . . . . . . . . . . . . . . . . . . . . . . 120
7.4 Cuckoo hashing . . . . . . . . . . . . . . . . . . . . . . . . . . 122
7.4.1 Structure . . . . . . . . . . . . . . . . . . . . . . . . . 123
7.4.2 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . 124
7.5 Practical issues . . . . . . . . . . . . . . . . . . . . . . . . . . 127
7.6 Bloom filters . . . . . . . . . . . . . . . . . . . . . . . . . . . 128
7.6.1 Construction . . . . . . . . . . . . . . . . . . . . . . . 128
7.6.2 False positives . . . . . . . . . . . . . . . . . . . . . . 128
7.6.3 Comparison to optimal space . . . . . . . . . . . . . . 130
7.6.4 Applications . . . . . . . . . . . . . . . . . . . . . . . 132
7.6.5 Counting Bloom filters . . . . . . . . . . . . . . . . . . 132
7.7 Data stream computation . . . . . . . . . . . . . . . . . . . . 133
7.7.1 Cardinality estimation . . . . . . . . . . . . . . . . . . 134
7.7.2 Count-min sketches . . . . . . . . . . . . . . . . . . . 136
7.7.2.1 Initialization and updates . . . . . . . . . . . 137
7.7.2.2 Queries . . . . . . . . . . . . . . . . . . . . . 137
7.7.2.3 Finding heavy hitters . . . . . . . . . . . . . 140
CONTENTS vi

7.8 Locality-sensitive hashing . . . . . . . . . . . . . . . . . . . . 140


7.8.1 Approximate nearest neighbor search . . . . . . . . . . 140
7.8.2 Locality-sensitive hash functions . . . . . . . . . . . . 141
7.8.3 Constructing an (r1 , r2 )-PLEB . . . . . . . . . . . . . 142
7.8.4 Hash functions for Hamming distance . . . . . . . . . 143
7.8.5 Hash functions for `1 distance . . . . . . . . . . . . . . 146

8 Dimension reduction 147


8.1 The Johnson-Lindenstrauss lemma . . . . . . . . . . . . . . . 147
8.1.1 Reduction to single-vector case . . . . . . . . . . . . . 148
8.1.2 A relatively simple proof of the lemma . . . . . . . . . 148
8.1.3 Distributional version . . . . . . . . . . . . . . . . . . 151
8.2 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151

9 Martingales and stopping times 153


9.1 Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153
9.2 Submartingales and supermartingales . . . . . . . . . . . . . 154
9.3 The optional stopping theorem . . . . . . . . . . . . . . . . . 155
9.4 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158
9.4.1 Random walks . . . . . . . . . . . . . . . . . . . . . . 158
9.4.2 Wald’s equation . . . . . . . . . . . . . . . . . . . . . 162
9.4.3 Maximal inequalities . . . . . . . . . . . . . . . . . . . 162
9.4.4 Waiting times for patterns . . . . . . . . . . . . . . . . 164

10 Markov chains 165


10.1 Basic definitions and properties . . . . . . . . . . . . . . . . . 166
10.1.1 Examples . . . . . . . . . . . . . . . . . . . . . . . . . 168
10.2 Convergence of Markov chains . . . . . . . . . . . . . . . . . . 170
10.2.1 Stationary distributions . . . . . . . . . . . . . . . . . 171
10.2.2 Total variation distance . . . . . . . . . . . . . . . . . 171
10.2.2.1 Total variation distance and expectation . . 173
10.2.3 Mixing time . . . . . . . . . . . . . . . . . . . . . . . . 173
10.2.4 Coupling of Markov chains . . . . . . . . . . . . . . . 174
10.2.5 Irreducible and aperiodic chains . . . . . . . . . . . . 175
10.2.6 Convergence of finite irreducible aperiodic Markov chains177
10.3 Reversible chains . . . . . . . . . . . . . . . . . . . . . . . . . 178
10.3.1 Stationary distributions . . . . . . . . . . . . . . . . . 178
10.3.2 Examples . . . . . . . . . . . . . . . . . . . . . . . . . 180
10.3.3 Time-reversed chains . . . . . . . . . . . . . . . . . . . 180
CONTENTS vii

10.3.4 Adjusting stationary distributions with the Metropolis-


Hastings algorithm . . . . . . . . . . . . . . . . . . . . 182
10.4 The coupling method . . . . . . . . . . . . . . . . . . . . . . . 183
10.4.1 Random walk on a cycle . . . . . . . . . . . . . . . . . 183
10.4.2 Random walk on a hypercube . . . . . . . . . . . . . . 185
10.4.3 Various shuffling algorithms . . . . . . . . . . . . . . . 186
10.4.3.1 Move-to-top . . . . . . . . . . . . . . . . . . 186
10.4.3.2 Random exchange of arbitrary cards . . . . . 187
10.4.3.3 Random exchange of adjacent cards . . . . . 187
10.4.3.4 Real-world shuffling . . . . . . . . . . . . . . 189
10.4.4 Path coupling . . . . . . . . . . . . . . . . . . . . . . . 189
10.4.4.1 Random walk on a hypercube . . . . . . . . 190
10.4.4.2 Sampling graph colorings . . . . . . . . . . . 191
10.4.4.3 Sampling independent sets . . . . . . . . . . 193
10.4.4.4 Simulated annealing . . . . . . . . . . . . . . 196
Single peak . . . . . . . . . . . . . . . . . . . . 197
Somewhat smooth functions . . . . . . . . . . . 197
10.5 Spectral methods for reversible chains . . . . . . . . . . . . . 198
10.5.1 Spectral properties of a reversible chain . . . . . . . . 198
10.5.2 Analysis of symmetric chains . . . . . . . . . . . . . . 199
10.5.3 Analysis of asymmetric chains . . . . . . . . . . . . . . 202
10.6 Conductance . . . . . . . . . . . . . . . . . . . . . . . . . . . 203
10.6.1 Easy cases for conductance . . . . . . . . . . . . . . . 204
10.6.2 Edge expansion using canonical paths . . . . . . . . . 204
10.6.3 Congestion . . . . . . . . . . . . . . . . . . . . . . . . 206
10.6.4 Examples . . . . . . . . . . . . . . . . . . . . . . . . . 208
10.6.4.1 Lazy random walk on a line . . . . . . . . . . 208
10.6.4.2 Random walk on a hypercube . . . . . . . . 208
10.6.4.3 Matchings in a graph . . . . . . . . . . . . . 209
10.6.4.4 Perfect matchings in dense bipartite graphs . 211

11 Approximate counting 214


11.1 Exact counting . . . . . . . . . . . . . . . . . . . . . . . . . . 214
11.2 Counting by sampling . . . . . . . . . . . . . . . . . . . . . . 215
11.2.1 Generating samples . . . . . . . . . . . . . . . . . . . 216
11.3 Approximating #KNAPSACK . . . . . . . . . . . . . . . . . 217
11.4 Approximating #DNF . . . . . . . . . . . . . . . . . . . . . . 220
11.5 Approximating exponentially improbable events . . . . . . . . 221
11.5.1 Matchings . . . . . . . . . . . . . . . . . . . . . . . . . 222
11.5.2 Other problems . . . . . . . . . . . . . . . . . . . . . . 223
CONTENTS viii

12 Hitting times 224


12.1 Waiting times . . . . . . . . . . . . . . . . . . . . . . . . . . . 224
12.2 Lyapunov functions . . . . . . . . . . . . . . . . . . . . . . . . 225
12.3 Drift analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . 230

13 The probabilistic method 234


13.1 Randomized constructions and existence proofs . . . . . . . . 234
13.1.1 Set balancing . . . . . . . . . . . . . . . . . . . . . . . 235
13.1.2 Ramsey numbers . . . . . . . . . . . . . . . . . . . . . 235
13.2 Approximation algorithms . . . . . . . . . . . . . . . . . . . . 237
13.2.1 MAX CUT . . . . . . . . . . . . . . . . . . . . . . . . 238
13.2.2 MAX SAT . . . . . . . . . . . . . . . . . . . . . . . . 239
13.3 The Lovász Local Lemma . . . . . . . . . . . . . . . . . . . . 242
13.3.1 General version . . . . . . . . . . . . . . . . . . . . . . 243
13.3.2 Symmetric version . . . . . . . . . . . . . . . . . . . . 244
13.3.3 Applications . . . . . . . . . . . . . . . . . . . . . . . 244
13.3.3.1 Graph coloring . . . . . . . . . . . . . . . . . 244
13.3.3.2 Satisfiability of k-CNF formulas . . . . . . . 245
13.3.3.3 Hypergraph 2-colorability . . . . . . . . . . . 245
13.3.4 Non-constructive proof . . . . . . . . . . . . . . . . . . 246
13.3.5 Constructive proof . . . . . . . . . . . . . . . . . . . . 249

14 Derandomization 253
14.1 Deterministic vs. randomized algorithms . . . . . . . . . . . . 254
14.2 Adleman’s theorem . . . . . . . . . . . . . . . . . . . . . . . . 255
14.3 Limited independence . . . . . . . . . . . . . . . . . . . . . . 256
14.3.1 MAX CUT . . . . . . . . . . . . . . . . . . . . . . . . 256
14.4 The method of conditional probabilities . . . . . . . . . . . . 257
14.4.1 MAX CUT using conditional probabilities . . . . . . . 258
14.4.2 Deterministic construction of Ramsey graphs . . . . . 259
14.4.3 Derandomized set balancing . . . . . . . . . . . . . . . 259

15 Probabilistically-checkable proofs 261


15.1 Probabilistically-checkable proofs . . . . . . . . . . . . . . . . 262
15.2 A PCP for GRAPH NON-ISOMORPHISM . . . . . . . . . . 263
15.2.1 GRAPH NON-ISOMORPHISM with private coins . . 263
15.2.2 A probabilistically-checkable proof for GRAPH NON-
ISOMORPHISM . . . . . . . . . . . . . . . . . . . . . 264
15.3 NP ⊆ PCP(poly(n), 1) . . . . . . . . . . . . . . . . . . . . . 264
15.3.1 QUADEQ . . . . . . . . . . . . . . . . . . . . . . . . . 264
CONTENTS ix

15.3.2 The Walsh-Hadamard Code . . . . . . . . . . . . . . . 265


15.3.3 A PCP for QUADEQ . . . . . . . . . . . . . . . . . . 266
15.4 PCP and approximability . . . . . . . . . . . . . . . . . . . . 267
15.4.1 Approximating the number of satisfied verifier queries 267
15.4.2 Gap-preserving reduction to MAX SAT . . . . . . . . 268
15.4.3 Other inapproximable problems . . . . . . . . . . . . . 269
15.5 Dinur’s proof of the PCP theorem . . . . . . . . . . . . . . . 270
15.6 The Unique Games Conjecture . . . . . . . . . . . . . . . . . 272

16 Quantum computing 274


16.1 Random circuits . . . . . . . . . . . . . . . . . . . . . . . . . 274
16.2 Bra-ket notation . . . . . . . . . . . . . . . . . . . . . . . . . 277
16.2.1 States as kets . . . . . . . . . . . . . . . . . . . . . . . 277
16.2.2 Composition of kets . . . . . . . . . . . . . . . . . . . 278
16.2.3 Operators as sums of kets times bras . . . . . . . . . . 278
16.3 Quantum circuits . . . . . . . . . . . . . . . . . . . . . . . . . 279
16.3.1 Quantum operations . . . . . . . . . . . . . . . . . . . 280
16.3.2 Quantum implementations of classical operations . . . 281
16.3.3 Phase representation of a function . . . . . . . . . . . 283
16.3.4 Practical issues (which we will ignore) . . . . . . . . . 284
16.3.5 Quantum computations . . . . . . . . . . . . . . . . . 284
16.4 Deutsch’s algorithm . . . . . . . . . . . . . . . . . . . . . . . 284
16.5 Grover’s algorithm . . . . . . . . . . . . . . . . . . . . . . . . 285
16.5.1 Initial superposition . . . . . . . . . . . . . . . . . . . 286
16.5.2 The Grover diffusion operator . . . . . . . . . . . . . . 286
16.5.3 Effect of the iteration . . . . . . . . . . . . . . . . . . 287

17 Randomized distributed algorithms 289


17.1 Consensus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 290
17.1.1 Impossibility of deterministic algorithms . . . . . . . . 291
17.2 Leader election . . . . . . . . . . . . . . . . . . . . . . . . . . 292
17.3 How randomness helps . . . . . . . . . . . . . . . . . . . . . . 292
17.4 Building a weak shared coin . . . . . . . . . . . . . . . . . . . 293
17.5 Leader election with sifters . . . . . . . . . . . . . . . . . . . 295
17.6 Consensus with sifters . . . . . . . . . . . . . . . . . . . . . . 296

A Sample assignments from Spring 2024 299


A.1 Assignment 1, due Thursday 2024-02-01 at 23:59 . . . . . . . 299
A.1.1 Matchings . . . . . . . . . . . . . . . . . . . . . . . . . 299
A.1.2 Non-volatile memory . . . . . . . . . . . . . . . . . . . 301
CONTENTS x

A.2 Assignment 2, due Thursday 2024-02-15 at 23:59 . . . . . . . 304


A.2.1 Mediocre cuts . . . . . . . . . . . . . . . . . . . . . . . 304
A.2.2 Training costs . . . . . . . . . . . . . . . . . . . . . . . 305
A.3 Assignment 3, due Thursday 2024-02-29 at 23:59 . . . . . . . 307
A.3.1 A robot rendezvous problem . . . . . . . . . . . . . . 307
A.3.2 A linked list . . . . . . . . . . . . . . . . . . . . . . . . 308
A.4 Assignment 4, due Thursday 2024-03-28 at 23:59 . . . . . . . 311
A.4.1 Nearly orthogonal vectors . . . . . . . . . . . . . . . . 311
A.4.2 Boosting a random walk . . . . . . . . . . . . . . . . . 312
A.5 Assignment 5, due Thursday 2024-04-11 at 23:59 . . . . . . . 315
A.5.1 Relaxation time for Metropolis-Hastings . . . . . . . . 315
A.5.2 A constrained random walk . . . . . . . . . . . . . . . 317
A.6 Assignment 6, due Thursday 2024-04-25 at 23:59 . . . . . . . 319
A.6.1 The power of intransigence . . . . . . . . . . . . . . . 319
Potential function argument . . . . . . . . . . . 321
Coupling argument . . . . . . . . . . . . . . . . 321
A.6.2 Almost Markov . . . . . . . . . . . . . . . . . . . . . . 322

B Sample assignments from Spring 2023 324


B.1 Assignment 1, due Thursday 2023-02-16 at 23:59 . . . . . . . 324
B.1.1 Hashing without counting . . . . . . . . . . . . . . . . 324
B.1.2 Permutation routing on an incomplete network . . . . 326
B.2 Assignment 2, due Thursday 2023-03-30 at 23:59 . . . . . . . 331
B.2.1 Some streaming data structures . . . . . . . . . . . . . 331
B.2.2 A dense network . . . . . . . . . . . . . . . . . . . . . 333
B.3 Assignment 3, due Thursday 2023-04-27 at 23:59 . . . . . . . 335
B.3.1 Shuffling a graph . . . . . . . . . . . . . . . . . . . . . 335
B.3.2 Counting unbalanced sets . . . . . . . . . . . . . . . . 337

C Sample assignments from Fall 2019 340


C.1 Assignment 1: due Thursday, 2019-09-12, at 23:00 . . . . . . 340
C.1.1 The golden ticket . . . . . . . . . . . . . . . . . . . . . 340
C.1.2 Exploding computers . . . . . . . . . . . . . . . . . . . 342
C.2 Assignment 2: due Thursday, 2019-09-26, at 23:00 . . . . . . 344
C.2.1 A logging problem . . . . . . . . . . . . . . . . . . . . 344
C.2.2 Return of the exploding computers . . . . . . . . . . . 345
C.3 Assignment 3: due Thursday, 2019-10-10, at 23:00 . . . . . . 346
C.3.1 Two data plans . . . . . . . . . . . . . . . . . . . . . . 346
C.3.2 A randomly-indexed list . . . . . . . . . . . . . . . . . 348
C.4 Assignment 4: due Thursday, 2019-10-31, at 23:00 . . . . . . 349
CONTENTS xi

C.4.1 A hash tree . . . . . . . . . . . . . . . . . . . . . . . . 349


C.4.2 Randomized robot rendezvous on a ring . . . . . . . . 350
C.5 Assignment 5: due Thursday, 2019-11-14, at 23:00 . . . . . . 352
C.5.1 Non-exploding computers . . . . . . . . . . . . . . . . 352
C.5.2 A wordy walk . . . . . . . . . . . . . . . . . . . . . . . 354
C.6 Assignment 6: due Monday, 2019-12-09, at 23:00 . . . . . . . 356
C.6.1 Randomized colorings . . . . . . . . . . . . . . . . . . 356
C.6.2 No long paths . . . . . . . . . . . . . . . . . . . . . . . 357
C.7 Final exam . . . . . . . . . . . . . . . . . . . . . . . . . . . . 359
C.7.1 A uniform ring . . . . . . . . . . . . . . . . . . . . . . 359
C.7.2 Forbidden runs . . . . . . . . . . . . . . . . . . . . . . 361
C.7.3 A derandomized balancing scheme . . . . . . . . . . . 362

D Sample assignments from Fall 2016 364


D.1 Assignment 1: due Sunday, 2016-09-18, at 17:00 . . . . . . . . 364
D.1.1 Bubble sort . . . . . . . . . . . . . . . . . . . . . . . . 364
D.1.2 Finding seats . . . . . . . . . . . . . . . . . . . . . . . 366
D.2 Assignment 2: due Thursday, 2016-09-29, at 23:00 . . . . . . 368
D.2.1 Technical analysis . . . . . . . . . . . . . . . . . . . . 368
D.2.2 Faulty comparisons . . . . . . . . . . . . . . . . . . . . 372
D.3 Assignment 3: due Thursday, 2016-10-13, at 23:00 . . . . . . 373
D.3.1 Painting with sprites . . . . . . . . . . . . . . . . . . . 373
D.3.2 Dynamic load balancing . . . . . . . . . . . . . . . . . 375
D.4 Assignment 4: due Thursday, 2016-11-03, at 23:00 . . . . . . 376
D.4.1 Re-rolling a random treap . . . . . . . . . . . . . . . . 376
D.4.2 A defective hash table . . . . . . . . . . . . . . . . . . 379
D.5 Assignment 5: due Thursday, 2016-11-17, at 23:00 . . . . . . 381
D.5.1 A spectre is haunting Halloween . . . . . . . . . . . . 381
D.5.2 Colliding robots on a line . . . . . . . . . . . . . . . . 382
D.6 Assignment 6: due Thursday, 2016-12-08, at 23:00 . . . . . . 386
D.6.1 Another colliding robot . . . . . . . . . . . . . . . . . 386
D.6.2 Return of the sprites . . . . . . . . . . . . . . . . . . . 389
D.7 Final exam . . . . . . . . . . . . . . . . . . . . . . . . . . . . 391
D.7.1 Virus eradication (20 points) . . . . . . . . . . . . . . 391
D.7.2 Parallel bubblesort (20 points) . . . . . . . . . . . . . 392
D.7.3 Rolling a die (20 points) . . . . . . . . . . . . . . . . . 393
CONTENTS xii

E Sample assignments from Spring 2014 395


E.1 Assignment 1: due Wednesday, 2014-09-10, at 17:00 . . . . . 395
E.1.1 Bureaucratic part . . . . . . . . . . . . . . . . . . . . . 395
E.1.2 Two terrible data structures . . . . . . . . . . . . . . . 395
E.1.3 Parallel graph coloring . . . . . . . . . . . . . . . . . . 399
E.2 Assignment 2: due Wednesday, 2014-09-24, at 17:00 . . . . . 401
E.2.1 Load balancing . . . . . . . . . . . . . . . . . . . . . . 401
E.2.2 A missing hash function . . . . . . . . . . . . . . . . . 401
E.3 Assignment 3: due Wednesday, 2014-10-08, at 17:00 . . . . . 402
E.3.1 Tree contraction . . . . . . . . . . . . . . . . . . . . . 402
E.3.2 Part testing . . . . . . . . . . . . . . . . . . . . . . . . 405
Using McDiarmid’s inequality and some cleverness406
E.4 Assignment 4: due Wednesday, 2014-10-29, at 17:00 . . . . . 407
E.4.1 A doubling strategy . . . . . . . . . . . . . . . . . . . 407
E.4.2 Hash treaps . . . . . . . . . . . . . . . . . . . . . . . . 408
E.5 Assignment 5: due Wednesday, 2014-11-12, at 17:00 . . . . . 409
E.5.1 Agreement on a ring . . . . . . . . . . . . . . . . . . . 409
E.5.2 Shuffling a two-dimensional array . . . . . . . . . . . . 412
E.6 Assignment 6: due Wednesday, 2014-12-03, at 17:00 . . . . . 413
E.6.1 Sampling colorings on a cycle . . . . . . . . . . . . . . 413
E.6.2 A hedging problem . . . . . . . . . . . . . . . . . . . . 414
E.7 Final exam . . . . . . . . . . . . . . . . . . . . . . . . . . . . 415
E.7.1 Double records (20 points) . . . . . . . . . . . . . . . 416
E.7.2 Hipster graphs (20 points) . . . . . . . . . . . . . . . . 417
Using the method of conditional probabilities . 417
Using hill climbing . . . . . . . . . . . . . . . . 418
E.7.3 Storage allocation (20 points) . . . . . . . . . . . . . . 418
E.7.4 Fault detectors in a grid (20 points) . . . . . . . . . . 419

F Sample assignments from Spring 2013 422


F.1 Assignment 1: due Wednesday, 2013-01-30, at 17:00 . . . . . 422
F.1.1 Bureaucratic part . . . . . . . . . . . . . . . . . . . . . 422
F.1.2 Balls in bins . . . . . . . . . . . . . . . . . . . . . . . . 422
F.1.3 A labeled graph . . . . . . . . . . . . . . . . . . . . . 423
F.1.4 Negative progress . . . . . . . . . . . . . . . . . . . . . 423
F.2 Assignment 2: due Thursday, 2013-02-14, at 17:00 . . . . . . 425
F.2.1 A local load-balancing algorithm . . . . . . . . . . . . 425
F.2.2 An assignment problem . . . . . . . . . . . . . . . . . 427
F.2.3 Detecting excessive collusion . . . . . . . . . . . . . . 427
F.3 Assignment 3: due Wednesday, 2013-02-27, at 17:00 . . . . . 429
CONTENTS xiii

F.3.1 Going bowling . . . . . . . . . . . . . . . . . . . . . . 429


F.3.2 Unbalanced treaps . . . . . . . . . . . . . . . . . . . . 430
F.3.3 Random radix trees . . . . . . . . . . . . . . . . . . . 432
F.4 Assignment 4: due Wednesday, 2013-03-27, at 17:00 . . . . . 433
F.4.1 Flajolet-Martin sketches with deletion . . . . . . . . . 433
F.4.2 An adaptive hash table . . . . . . . . . . . . . . . . . 435
F.4.3 An odd locality-sensitive hash function . . . . . . . . . 437
F.5 Assignment 5: due Friday, 2013-04-12, at 17:00 . . . . . . . . 438
F.5.1 Choosing a random direction . . . . . . . . . . . . . . 438
F.5.2 Random walk on a tree . . . . . . . . . . . . . . . . . 439
F.5.3 Sampling from a tree . . . . . . . . . . . . . . . . . . . 440
F.6 Assignment 6: due Friday, 2013-04-26, at 17:00 . . . . . . . . 441
F.6.1 Increasing subsequences . . . . . . . . . . . . . . . . . 441
F.6.2 Futile word searches . . . . . . . . . . . . . . . . . . . 442
F.6.3 Balance of power . . . . . . . . . . . . . . . . . . . . . 444
F.7 Final exam . . . . . . . . . . . . . . . . . . . . . . . . . . . . 445
F.7.1 Dominating sets . . . . . . . . . . . . . . . . . . . . . 445
F.7.2 Tricolor triangles . . . . . . . . . . . . . . . . . . . . . 446
F.7.3 The n rooks problem . . . . . . . . . . . . . . . . . . . 447
F.7.4 Pursuing an invisible target on a ring . . . . . . . . . 447

G Sample assignments from Spring 2011 449


G.1 Assignment 1: due Wednesday, 2011-01-26, at 17:00 . . . . . 449
G.1.1 Bureaucratic part . . . . . . . . . . . . . . . . . . . . . 449
G.1.2 Rolling a die . . . . . . . . . . . . . . . . . . . . . . . 449
G.1.3 Rolling many dice . . . . . . . . . . . . . . . . . . . . 451
G.1.4 All must have candy . . . . . . . . . . . . . . . . . . . 451
G.2 Assignment 2: due Wednesday, 2011-02-09, at 17:00 . . . . . 452
G.2.1 Randomized dominating set . . . . . . . . . . . . . . . 452
G.2.2 Chernoff bounds with variable probabilities . . . . . . 454
G.2.3 Long runs . . . . . . . . . . . . . . . . . . . . . . . . . 455
G.3 Assignment 3: due Wednesday, 2011-02-23, at 17:00 . . . . . 457
G.3.1 Longest common subsequence . . . . . . . . . . . . . . 457
G.3.2 A strange error-correcting code . . . . . . . . . . . . . 459
G.3.3 A multiway cut . . . . . . . . . . . . . . . . . . . . . . 460
G.4 Assignment 4: due Wednesday, 2011-03-23, at 17:00 . . . . . 461
G.4.1 Sometimes successful betting strategies are possible . 461
G.4.2 Random walk with reset . . . . . . . . . . . . . . . . . 463
G.4.3 Yet another shuffling algorithm . . . . . . . . . . . . . 465
G.5 Assignment 5: due Thursday, 2011-04-07, at 23:59 . . . . . . 466
CONTENTS xiv

G.5.1 A reversible chain . . . . . . . . . . . . . . . . . . . . 466


G.5.2 Toggling bits . . . . . . . . . . . . . . . . . . . . . . . 467
G.5.3 Spanning trees . . . . . . . . . . . . . . . . . . . . . . 469
G.6 Assignment 6: due Monday, 2011-04-25, at 17:00 . . . . . . . 470
G.6.1 Sparse satisfying assignments to DNFs . . . . . . . . . 470
G.6.2 Detecting duplicates . . . . . . . . . . . . . . . . . . . 471
G.6.3 Balanced Bloom filters . . . . . . . . . . . . . . . . . . 472
G.7 Final exam . . . . . . . . . . . . . . . . . . . . . . . . . . . . 475
G.7.1 Leader election . . . . . . . . . . . . . . . . . . . . . . 475
G.7.2 Two-coloring an even cycle . . . . . . . . . . . . . . . 476
G.7.3 Finding the maximum . . . . . . . . . . . . . . . . . . 477
G.7.4 Random graph coloring . . . . . . . . . . . . . . . . . 478

H Sample assignments from Spring 2009 479


H.1 Final exam, Spring 2009 . . . . . . . . . . . . . . . . . . . . . 479
H.1.1 Randomized mergesort (20 points) . . . . . . . . . . . 479
H.1.2 A search problem (20 points) . . . . . . . . . . . . . . 480
H.1.3 Support your local police (20 points) . . . . . . . . . . 481
H.1.4 Overloaded machines (20 points) . . . . . . . . . . . . 482

I Probabilistic recurrences 483


I.1 Recurrences with constant cost functions . . . . . . . . . . . . 483
I.2 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 483
I.3 The Karp-Upfal-Wigderson bound . . . . . . . . . . . . . . . 484
I.3.1 Waiting for heads . . . . . . . . . . . . . . . . . . . . . 486
I.3.2 Quickselect . . . . . . . . . . . . . . . . . . . . . . . . 486
I.3.3 Tossing coins . . . . . . . . . . . . . . . . . . . . . . . 486
I.3.4 Coupon collector . . . . . . . . . . . . . . . . . . . . . 487
I.3.5 Chutes and ladders . . . . . . . . . . . . . . . . . . . . 487
I.4 High-probability bounds . . . . . . . . . . . . . . . . . . . . . 488
I.4.1 High-probability bounds from expectation bounds . . 489
I.4.2 Detailed analysis of the recurrence . . . . . . . . . . . 489
I.5 More general recurrences . . . . . . . . . . . . . . . . . . . . . 490

Bibliography 491

Index 510
List of Figures

2.1 Karger’s min-cut algorithm . . . . . . . . . . . . . . . . . . . 26

4.1 Proof of Jensen’s inequality . . . . . . . . . . . . . . . . . . . 56

5.1 Comparison of Chernoff bound variants . . . . . . . . . . . . 72


5.2 Hypercube network with n = 3 . . . . . . . . . . . . . . . . . 77

6.1 Tree rotations . . . . . . . . . . . . . . . . . . . . . . . . . . . 103


6.2 Balanced and unbalanced binary search trees . . . . . . . . . 104
6.3 Binary search tree after inserting 5 1 7 3 4 6 2 . . . . . . . . 105
6.4 Inserting values into a treap . . . . . . . . . . . . . . . . . . . 106
6.5 Tree rotation shortens spines . . . . . . . . . . . . . . . . . . 110
6.6 Skip list . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111

10.1 Drawing a Markov chain as a digraph . . . . . . . . . . . . . 167

13.1 Tricky step in MAX SAT argument . . . . . . . . . . . . . . . 242

A.1 Non-volatile memory in action . . . . . . . . . . . . . . . . . 302

B.1 An incomplete network with n = 2d = 4 . . . . . . . . . . . . 327


B.2 Graph density evolution . . . . . . . . . . . . . . . . . . . . . 333

D.1 Filling a screen with Space Invaders . . . . . . . . . . . . . . 374


D.2 Hidden Space Invaders . . . . . . . . . . . . . . . . . . . . . . 390

E.1 Example of tree contraction for Problem E.3.1 . . . . . . . . 403

F.1 Radix tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 432


F.2 Word searches . . . . . . . . . . . . . . . . . . . . . . . . . . . 443

xv
List of Tables

3.1 Sum of two dice . . . . . . . . . . . . . . . . . . . . . . . . . . 30


3.2 Various conditional expectations on two independent dice . . 46

5.1 Concentration bounds . . . . . . . . . . . . . . . . . . . . . . 60

7.1 Hash table parameters . . . . . . . . . . . . . . . . . . . . . . 116

xvi
List of Algorithms

7.1 Insertion procedure for cuckoo hashing . . . . . . . . . . . . . 124

17.1 Attiya-Censor weak shared coin [AC08] . . . . . . . . . . . . . 294


17.2 Giakkoupis-Woelfel sifter [GW12] . . . . . . . . . . . . . . . . 295
17.3 Sifter using max registers [AE19] . . . . . . . . . . . . . . . . . 298

D.1 One pass of bubble sort . . . . . . . . . . . . . . . . . . . . . . 365

F.1 Adaptive hash table insertion . . . . . . . . . . . . . . . . . . . 436

G.1 Dubious duplicate detector . . . . . . . . . . . . . . . . . . . . 471


G.2 Randomized max-finding algorithm . . . . . . . . . . . . . . . 477

xvii
Preface

These are notes for the Yale course CPSC 469/569 Randomized Algorithms.
This document also incorporates the lecture schedule and assignments, as
well as some sample assignments from previous semesters. Because this
is a work in progress, it will be updated frequently over the course of the
semester.
Much of the structure of the course follows Mitzenmacher and Upfals’s
Probability and Computing [MU17], with some material from Motwani and
Raghavan’s Randomized Algorithms [MR95]. In most cases you’ll find these
textbooks contain much more detail than what is presented here, so it is
probably better to consider this document a supplement to them than to
treat it as your primary source of information.
The most recent version of these notes will be available at https:
//www.cs.yale.edu/homes/aspnes/classes/469/notes.pdf. More stable
archival versions may be found at https://arxiv.org/abs/2003.01902.
I would like to thank my many students and teaching fellows over the
years for their help in pointing out errors and omissions in earlier drafts of
these notes.

xviii
Chapter 1

Randomized algorithms

A randomized algorithm flips coins during its execution to determine what


to do next. When considering a randomized algorithm, we usually care
about its expected worst-case performance, which is the average amount
of time it takes on the worst input of a given size. This average is computed
over all the possible outcomes of the coin flips during the execution of the
algorithm. We may also ask for a high-probability bound, showing that
the algorithm doesn’t consume too much resources most of the time.
In studying randomized algorithms, we consider pretty much the same
issues as for deterministic algorithms: how to design a good randomized
algorithm, and how to prove that it works within given time or error bounds.
The main difference is that it is often easier to design a randomized algo-
rithm—randomness turns out to be a good substitute for cleverness more
often than one might expect—but harder to analyze it. So much of what
one does is develop good techniques for analyzing the often very complex
random processes that arise in the execution of an algorithm. Fortunately,
in doing so we can often use techniques already developed by probabilists
and statisticians for analyzing less overtly algorithmic processes.
Formally, we think of a randomized algorithm as a machine M that
computes M (x, r), where x is the problem input and r is the sequence of
random bits. Our machine model is the usual random-access machine or
RAM model, where we have a memory space that is typically polynomial in
the size of the input n, and in constant time we can read a memory location,
write a memory location, or perform arithmetic operations on integers of
up to O(log n) bits.1 In this model, we may find it easier to think of the
1
This model is unrealistic in several ways: the assumption that we can perform arithmetic
on O(log n)-bit quantities in constant time omits at least a factor of Ω(log log n) for addition

1
CHAPTER 1. RANDOMIZED ALGORITHMS 2

random bits as supplied as needed by some subroutine, where generating


a random integer of size O(log n) takes constant time; the justification for
this assumption is that it takes constant time to read the next O(log n)-sized
value from the random input.
Because the number of these various constant-time operations, and thus
the running time for the algorithm as a whole, may depend on the random
bits, it is now a random variable—a function on points in some probability
space. The probability space Ω consists of all possible sequences r, each
of which is assigned a probability Pr [r] (typically 2−|r| ), and the running
time for M on some input x is generally given as an expected value2
Er [time(M (x, r))], where for any X,
X
Er [X] = X(r) Pr [r] . (1.0.1)
r∈Ω

We can now quote the performance of M in terms of this expected value:


where we would say that a deterministic algorithms runs in time O(f (n)),
where n = |x| is the size of the input, we instead say that our randomized algo-
rithm runs in expected time O(f (n)), which means that Er [time(M (x, r))] =
O(f (|x|)) for all inputs x.
This is distinct from traditional worst-case analysis, where there is no
r and no expectation, and average-case analysis, where there is again no
r and the value reported is not a maximum but an expectation over some
distribution on x. The following trivial example shows the distinction.

1.1 Searching an array


Suppose we have an array A with n elements, and we are told that one of
the positions in the array contains a nonzero value (the “prize”) while every
other position contains zero. How quickly can we find the prize in the worst
case?
Any deterministic algorithm will probe the array positions in some
predictable order. An adversary can simulate the algorithm running in an
and probably more for multiplication in any realistic implementation, while the assumption
that we can address nc distinct locations in memory in anything less than nc/3 time in the
worst case requires exceeding the speed of light. But for reasonably small n, this gives a
pretty good approximation of the performance of real computers, which do in fact perform
arithmetic and access memory in a fixed amount of time, although with fixed bounds on
the size of both arithmetic operands and memory.
2
We’ll see more details of these and other concepts from probability theory in Chapters 2
and 3.
CHAPTER 1. RANDOMIZED ALGORITHMS 3

execution where it always sees 0 until it has checked every location. By


putting the prize in the last place the algorithm looks, we get a worst-case
input that requires at least n probes to find it.
In the average case, we might assume that instead of placing the prize in
the worst possible position, it is placed uniformly at random. Now if we just
scan the array from left to right, the expected number of probes will be
n n
X X 1
i · Pr [algorithm does i probes] = i·
i=1 i=1
n
1 n(n + 1)
= ·
n 2
n+1
= .
2
This is still Θ(n), but we’ve improved the constant factor by almost a factor
of 2. The cost is that we trust the adversary to place the prize randomly—but
there is no reason for the adversary to do this.
The trick to randomized algorithms is that we can can obtain the same
expected payoff even in the worst case by supplying the randomness ourselves.
Let’s suppose that instead of scanning the array predictably from left to
right, we flip a coin and either scan left-to-right or scan right-to-left with
probability 1/2 each. If the prize is in position i, then scanning left-to-right
finds it in i probes, while scanning right-to-left finds it in n − i + 1 probes.
The expected cost is thus
1 1 n+1
· i + · (n − i + 1) = ,
2 2 2
the same as in the average case. But now we don’t make any assumptions
about the input—by using just one bit of randomness we get this expected
time no matter where the adversary puts the prize.
A natural question is whether a more clever algorithm can do even better.
Usually it’s pretty hard to prove lower bounds on algorithms, but in this
case we can use a classic result called Yao’s lemma [Yao77] to show a
worst-case lower bound on the expected cost of any randomized algorithm
by constructing a probability distribution on inputs that gives a bad average-
case lower bound for any deterministic algorithm. We’ll give a proof of this
in §3.6.1, but the intuition is that any randomized algorithm acts like picking
a deterministic algorithm at random, so if there is an input distribution that
is bad for all deterministic algorithms, it is still bad even if we pick one of
those algorithms randomly.
CHAPTER 1. RANDOMIZED ALGORITHMS 4

In this case, the bad input distribution is simple: put the 1 in each
position A[i] with equal probability 1/n. For a deterministic algorithm, there
will be some fixed sequence of positions i1 , i2 , . . . that it examines as long as
it only sees zeros. A smart deterministic algorithm will not examine the same
position twice, so the 1 is equally likely to be found 1, 2, 3, . . . , n probes. This
gives the same expected n+1 2 probes as for the simple randomized algorithm,
which shows that that algorithm is optimal.
We’ve been talking about searching an array, because that fits best in
our model of an input supplied to the algorithm, but essentially the same
analysis applies to brute-force inverting a black-box function. Here we have
a function f and target output y, and we want to find an input x such that
f (x) = y. The same analysis as for the array case shows that this takes n+1 2
expected evaluations of f assuming that exactly one x works and we can’t
do anything clever.

Curiously, in this case it may be possible to improve this bound to O( n)
evaluations if somehow we get our hands on a working quantum computer.
We’ll come back to this when we discuss quantum computing in general and
Grover’s algorithm in particular in Chapter 16.

1.2 Verifying polynomial identities


This classic problem is described in [MU17, §1.1]. Here we are given two
products of polynomials and we want to determine if they compute the same
function. For example, we might have

p(x) = (x − 7)(x − 3)(x − 1)(x + 2)(2 x + 5)


q(x) = 2 x5 − 13 x4 − 21 x3 + 127 x2 + 121 x − 210

These expressions both represent degree-5 polynomials, and it is not


obvious without multiplying out the factors of p whether they are equal or
not. Multiplying out all the factors of p may take as much as O(d2 ) time if we
assume integer multiplication takes unit time and do it the straightforward
way.3 We can do better than this using randomization.
The trick is that evaluating p(x) and q(x) takes only O(d) integer opera-
tions, and we will find p(x) = q(x) only if either (a) p(x) and q(x) are really
the same polynomial, or (b) x is a root of p(x) − q(x). Since p(x) − q(x)
has degree at most d, it can’t have more than d roots. So if we choose x
uniformly at random from some much larger space, it’s likely that we will not
3
It can be faster if we do something sneaky like use fast Fourier transforms [SS71].
CHAPTER 1. RANDOMIZED ALGORITHMS 5

get a root. Indeed, evaluating p(11) = 112320 and q(11) = 120306 quickly
shows that p and q are not in fact the same.
This is an example of a Monte Carlo algorithm, which is an algorithm
that runs in a fixed amount of time but only gives the right answer some of
the time. (In this case, with probability 1 − d/r, where r is the size of the
range of random integers we choose x from.) Monte Carlo algorithms have
the unnerving property of not indicating when their results are incorrect,
but we can make the probability of error as small as we like by running
the algorithm repeatedly. For this particular algorithm, the probability of
error after k trials is only (d/r)k , which means that for fixed d/r we need
O(log(1/)) iterations to get the error bound down to any given . If we are
really paranoid, we could get the error down to 0 by testing d + 1 distinct
values, but now the cost is as high as multiplying out p again.
The error for this algorithm is one-sided: if we find a witness to the fact
that p 6= q, we are done, but if we don’t, then all we know is that we haven’t
found a witness yet. We also have the property that if we check enough
possible witnesses, we are guaranteed to find one.
A similar property holds in the classic Miller-Rabin primality test,
a randomized algorithm for determining whether a large integer is prime
or not.4 The original version, due to Gary Miller [Mil76] showed that, as
in polynomial identity testing, it might be sufficient to pick a particular
set of deterministic candidate witnesses. Unfortunately, this result depends
on the truth of the extended Riemann hypothesis, a notoriously difficult
open problem in number theory. Michael Rabin [Rab80] demonstrated that
choosing random witnesses was enough, if we were willing to accept a small
probability of incorrectly identifying a composite number as prime.
For many years it was open whether it was possible to test primality
deterministically in polynomial time without unproven number-theoretic
assumptions, and the randomized Miller-Rabin algorithm was one of the
most widely-used randomized algorithms for which no good deterministic
alternative was known. Eventually, Agrawal et al. [AKS04] demonstrated
how to test primality deterministically using a different technique, although
the cost of their algorithm is high enough that Miller-Rabin is still used in
practice.
4
We will not describe this algorithm here.
CHAPTER 1. RANDOMIZED ALGORITHMS 6

1.3 Randomized QuickSort


The QuickSort algorithm [Hoa61a] works as follows. For simplicity, we
assume that no two elements of the array being sorted are equal.

• If the array has > 1 elements,


– Pick a pivot p uniformly at random from the elements of the
array.
– Split the array into A1 and A2 , where A1 contains all elements
< p elements > p.
– Sort A1 and A2 recursively and return the sequence A1 , p, A2 .
• Otherwise return the array.

The splitting step takes exactly n − 1 comparisons, since we have to check


each non-pivot against the pivot. We assume all other costs are dominated by
the cost of comparisons. How many comparisons does randomized QuickSort
do on average?
There are two ways to solve this problem: the dumb way and the smart
way. We’ll do it the dumb way now and save the smart way for §1.3.2.

1.3.1 Brute force method: solve the recurrence


Let T (n) be the expected number of comparisons done on an array of n
elements. We have T (0) = T (1) = 0 and for larger n,

1 n−1
X
T (n) = (n − 1) + (T (k) + T (n − 1 − k)) . (1.3.1)
n k=0

Why? Because we do (n − 1) comparisons to split the piles, there are n


equally-likely choices for our pivot (hence the 1/n), and for each choice the
expected cost of the recursive sorts is T (k) + T (n − 1 − k), where k is the
number of elements that land in A1 . Formally, we are using here the law of
total probability, which says that for any random variable X and partition
of the probability space into events B1 . . . Bn , then
X
E [X] = Pr [Bi ] E [X | Bi ] ,
where
X Pr [ω]
E [X | Bi ] = X(ω)
ω∈Bi
Pr [Bi ]
CHAPTER 1. RANDOMIZED ALGORITHMS 7

is the conditional expectation of X conditioned on Bi , which we can


think of as just the average value of X if we know that Bi occurred. (See
§2.3.2 for more details.)
So now we just have to solve this ugly recurrence. We can reasonably
guess that when n ≥ 1, T (n) ≤ an log n for some constant a. Clearly this
holds for n = 1. Now apply induction on larger n to get

1 n−1
X
T (n) = (n − 1) + (T (k) + T (n − 1 − k))
n k=0
2 n−1
X
= (n − 1) + T (k)
n k=0
2 n−1
X
= (n − 1) + T (k)
n k=1
2 n−1
X
≤ (n − 1) + ak log k
n k=1
Z n
2
≤ (n − 1) + ak log k
n k=1
!
2a n2 log n n2 1
= (n − 1) + − +
n 2 4 4
an a
= (n − 1) + an log n − + .
2 2n
If we squint carefully at this recurrence for a while we notice that setting
a = 2 makes this less than or equal to an log n, since the remaining terms
become (n − 1) − n + 1/n = 1/n − 1, which is negative for n ≥ 1. We can
thus confidently conclude that T (n) ≤ 2n log n (for n ≥ 1).

1.3.2 Clever method: use linearity of expectation


Alternatively, we can use linearity of expectation (which we’ll discuss
further in §3.4.1) to compute the expected number of comparisons used by
randomized QuickSort.
Imagine we use the following method for choosing pivots: we generate a
random permutation of all the elements in the array, and when asked to sort
some subarray A0 , we use as pivot the first element of A0 that appears in our
list. Since each element is equally likely to be first, this is equivalent to the
actual algorithm. Pretend that we are always sorting the numbers 1 . . . n and
define for each pair of elements i < j the indicator variable Xij to be 1 if
CHAPTER 1. RANDOMIZED ALGORITHMS 8

i is compared to j at some point during the execution of the algorithm and 0


otherwise. Amazingly, we can actually compute the probability of this event
(and thus E [Xij ]): the only time i and j are compared is if one of them is
chosen as a pivot before they are split up into different arrays. How do they
get split up into different arrays? This happens if some intermediate element
k is chosen as pivot first, that is, if some k with i < k < j appears in the
permutation before both i and j. Occurrences of other elements don’t affect
the outcome, so we can concentrate on the restriction of the permutations
to just the numbers i through j, and we win if this restricted permutation
starts with either i or j. This event occurs with probability 2/(j − i + 1), so
we have E [Xij ] = 2/(j − i + 1). Summing over all pairs i < j gives:
 
X X
E  Xij  = E [Xij ]
i<j i<j
X 2
=
i<j
j−i+1
X n−i+1
n−1 X 2
=
i=1 k=2
k
n Xi
X 2
=
i=2 k=2
k
n
X 2(n − k + 1)
=
k=2
k
n 
2(n + 1)
X 
= −2
k=2
k
n
X 2(n + 1)
= − 2(n − 1)
k=2
k
= 2(n + 1)(Hn − 1) − 2(n − 1)
= 2(n + 1)Hn − 4n.

Here Hn = ni=1 1i is the n-th harmonic number, equal to ln n + γ +


P

O(n−1 ), where γ ≈ 0.5772 is the Euler-Mascheroni constant (whose exact


value is unknown!). For asymptotic purposes we only need Hn = Θ(log n).
For the first step we are taking advantage of the fact that linearity of
expectation doesn’t care about the variables not being independent. The
rest is just algebra.
This is pretty close to the bound of 2n log n we computed using the
CHAPTER 1. RANDOMIZED ALGORITHMS 9

recurrence in §1.3.1. Given that we now know the exact answer, we could in
principle go back and use it to solve the recurrence exactly.5
Which way is better? Solving the recurrence requires less probabilistic
handwaving (a more polite term might be “insight”) but more grinding
out inequalities, which is a pretty common trade-off. Since I am personally
not very clever I would try the brute-force approach first. But it’s worth
knowing about better methods so you can try them in other situations.

1.4 Where does the randomness come from?


Typically we assume in analyzing a randomized algorithm that we have
access to genuinely random bits r, and don’t ask too carefully how we can
get these random bits. For practical applications, there are basically three
choices, from strongest (and most expensive) to weakest (and cheapest):

1. Physical randomness. Lock a cat in a box with a radioactive source


and a Geiger counter attached to a solenoid aimed at a vial of prussic
acid [Sch35]. Come back in an hour and check if the cat is still breathing.
With an appropriately-tuned radioactive source, this generates one
very unpredictable bit of randomness at the cost of one half of a cat
on average.6
There are cheaper variants, mostly involving amplified quantum noise
at the high end, or monitoring physical processes that are expected
to be somewhat random (like intervals between keyboard presses or
seek times of hard drive heads) at the low end. In each case you get a
sequence of random bits that, under plausible physical assumptions,
are effectively unpredictable.
The /dev/random device in Linux systems gives you access to the cheap
kind of physical randomness, and will block waiting for you to wave
your mouse around if it runs out.

2. Cryptographically-secure pseudorandomness. Find some func-


tion that spits out a sequence of random-looking values given a seed,
such that if the seed is chosen uniformly at random (say using a physical
random number generator), no polynomial-time program can distin-
guish the sequence from an actual random sequence with non-trivial
5
We won’t.
6
Expected cost may be higher if the observer forgets their gas mask.
CHAPTER 1. RANDOMIZED ALGORITHMS 10

probability. Usually expensive, but if your polynomial-time random-


ized algorithm fails using a CPRNG, you’ve succeeded in breaking its
cryptographic assumptions.
The /dev/urandom device in Linux systems gives you this, based on a
seed derived from the same sources as /dev/random.

3. Statistical pseudorandomness. As above, but choose a function


that is not cryptographically secure but merely passes common statis-
tical tests for things like k-wise independence of consecutive outputs.
Which function to choose is largely a matter of convenience and fashion
(although PRNGs in older standard libraries can be very, very bad).
The advantage is that many PRNGs are very fast and will not slow
your program down. The disadvantage is that you are relying on your
program not doing anything that exposes the weakness of the PRNG.
The random function in the standard C library is an example of this,
although maybe not a good example. The cool kids mostly use Mersenne
Twister [MN98].

One advantage of pseudorandom generators is that they allow for debug-


ging: if you run your program twice with the same key, it will do the same
thing. This was in fact an argument made by von Neumann against using
physical randomness back in the old days [VN63].
In practice, most people just use whatever PRNG is ready to hand, and
hope it works. As theorists, we will ignore this issue, and assume that our
random inputs are actually random.

1.5 Classifying randomized algorithms


Different random algorithms make different guarantees about the likelihood
of getting a correct output or even the possibility of recognizing a correct
output. These different guarantees have established names in the randomized
algorithms literature and are correspond to various complexity classes in
complexity theory.

1.5.1 Las Vegas vs Monte Carlo


One difference between QuickSort and polynomial equality testing that
QuickSort always succeeds, but may run for longer than you expect; while
the polynomial equality tester always runs in a fixed amount of time, but may
CHAPTER 1. RANDOMIZED ALGORITHMS 11

produce the wrong answer. These are examples of two classes of randomized
algorithms, which were originally named by László Babai [Bab79]:7
• A Las Vegas algorithm fails with some probability, but we can tell
when it fails. In particular, we can run it again until it succeeds, which
means that we can eventually succeed with probability 1 (but with a
potentially unbounded running time). Alternatively, we can think of a
Las Vegas algorithm as an algorithm that runs for an unpredictable
amount of time but always succeeds (we can convert such an algorithm
back into one that runs in bounded time by declaring that it fails if it
runs too long—a condition we can detect). QuickSort is an example of
a Las Vegas algorithm.
• A Monte Carlo algorithm fails with some probability, but we can’t
tell when it fails. If the algorithm produces a yes/no answer and the
failure probability is significantly less than 1/2, we can reduce the
probability of failure by running it many times and taking a majority of
the answers. The polynomial equality-testing algorithm in an example
of a Monte Carlo algorithm.
The heuristic for remembering which class is which is that the names
were chosen to appeal to English speakers: in Las Vegas, the dealer can tell
you whether you’ve won or lost, but in Monte Carlo, le croupier ne parle que
Français, so you have no idea what he’s saying.
Generally, we prefer Las Vegas algorithms, because we like knowing
when we have succeeded. But sometimes we have to settle for Monte Carlo
algorithms, which can still be useful if we can get the probability of failure
small enough. For example, any time we try to estimate an average by
sampling (say, inputs to a function we are trying to integrate or political
views of voters we are trying to win over) we are running a Monte Carlo
algorithm: there is always some possibility that our sample is badly non-
representative, but we can’t tell if we got a bad sample unless we already
know the answer we are looking for.

1.5.2 Randomized complexity classes


Las Vegas vs Monte Carlo is the typical distinction made by algorithm
designers, but complexity theorists have developed more elaborate classifi-
cations. These include algorithms with “one-sided” failure properties. For
7
To be more precise, Babai defined Monte Carlo algorithms based on the properties of
Monte Carlo simulation, a technique dating back to the Manhattan project. The term
Las Vegas algorithm was new.
CHAPTER 1. RANDOMIZED ALGORITHMS 12

these algorithms, we never get a bogus “yes” answer but may get a bogus
“no” answer (or vice versa). This gives us several complexity classes that act
like randomized versions of NP, co-NP, etc.:

• The class R or RP (randomized P) consists of all languages L for which


a polynomial-time Turing machine M exists such that if x ∈ L, then
Pr [M (x, r) = 1] ≥ 1/2 and if x 6∈ L, then Pr [M (x, r) = 1] = 0. In
other words, we can find a witness that x ∈ L with constant probability.
This is the randomized analog of NP (but it’s much more practical,
since with NP the probability of finding a winning witness may be
exponentially small).

• The class co-R consists of all languages L for which a poly-time Turing
machine M exists such that if x 6∈ L, then Pr [M (x, r) = 1] ≥ 1/2 and
if x ∈ L, then Pr [M (x, r) = 1] = 0. This is the randomized analog of
co-NP.

• The class ZPP (zero-error probabilistic P ) is defined as RP ∩ co-RP.


If we run both our RP and co-RP machines for polynomial time, we
learn the correct classification of x with probability at least 1/2. The
rest of the time we learn only that we’ve failed (because both machines
return 0, telling us nothing). This is the class of (polynomial-time)
Las Vegas algorithms. The reason it is called “zero-error” is that we
can equivalently define it as the problems solvable by machines that
always output the correct answer eventually, but only run in expected
polynomial time.

• The class BPP (bounded-error probabilistic P) consists of all languages


L for which a poly-time Turing machine exists such that if x 6∈ L,
then Pr [M (x, r) = 1] ≤ 1/3, and if x ∈ L, then Pr [M (x, r) = 1] ≥
2/3. These are the (polynomial-time) Monte Carlo algorithms: if our
machine answers 0 or 1, we can guess whether x ∈ L or not, but we
can’t be sure.

• The class PP (probabilistic P) consists of all languages L for which a


poly-time Turing machine exists such that if x 6∈ L, then Pr [M (x, r) = 1] ≥
1/2, and if x ∈ L, then Pr [M (x, r) = 1] < 1/2. Since there is only an
exponentially small gap between the two probabilities, such algorithms
are not really useful in practice; PP is mostly of interest to complexity
theorists.
CHAPTER 1. RANDOMIZED ALGORITHMS 13

Assuming we have a source of random bits, any algorithm in RP, co-RP,


ZPP, or BPP is good enough for practical use. We can usually even get
away with using a pseudorandom number generator, and there are plausible
reasons to suspect that in fact every one of these classes is equal to P.

1.6 Classifying randomized algorithms by their meth-


ods
We can also classify randomized algorithms by how they use their randomness
to solve a problem. Some very broad categories:8

• Avoiding worst-case inputs, by hiding the details of the algorithm


from the adversary. Typically we assume that an adversary supplies
our input. If the adversary can see what our algorithm is going to
do (for example, he knows which door we will open first), he can use
this information against us. By using randomness, we can replace our
predictable deterministic algorithm by what is effectively a random
choice of many different deterministic algorithms. Since the adversary
doesn’t know which algorithm we are using, he can’t (we hope) pick
an input that is bad for all of them.

• Sampling. Here we use randomness to find an example or examples


of objects that are likely to be typical of the population they are drawn
from, either to estimate some average value (pretty much the basis of
all of statistics) or because a typical element is useful in our algorithm
(for example, when picking the pivot in QuickSort). Randomization
means that the adversary can’t direct us to non-representative samples.

• Hashing. Hashing is the process of assigning a large object x a small


name h(x) by feeding it to a hash function h. Because the names are
small, the Pigeonhole Principle implies that many large objects hash to
the same name (a collision). If we have few objects that we actually
care about, we can avoid collisions by choosing a hash function that
happens to map them to different places. Randomization helps here
by keeping the adversary from choosing the objects after seeing what
our hash function is.
Hashing techniques are used both in load balancing (e.g., insuring
that most cells in a hash table hold only a few objects) and in finger-
8
These are largely adapted from the introduction to [MR95].
CHAPTER 1. RANDOMIZED ALGORITHMS 14

printing (e.g., using a cryptographic hash function to record a


fingerprint of a file, so that we can detect when it has been modified).

• Building random structures. The probabilistic method shows


the existence of structures with some desired property (often graphs
with interesting properties, but there are other places where it can be
used) by showing that a randomly-generated structure in some class
has a nonzero probability of having the property we want. If we can
beef the probability up to something substantial, we get a randomized
algorithm for generating these structures.

• Symmetry breaking. In distributed algorithms involving multi-


ple processes, progress may be stymied by all the processes trying to
do the same thing at the same time (this is an obstacle, for example,
in leader election, where we want only one process to declare itself
the leader). Randomization can break these deadlocks.
Chapter 2

Probability theory

In this chapter, we summarize the parts of probability theory that we


will use in the course. This is not really a substitute for reading an actual
probability theory book like Feller [Fel68] or Grimmett and Stirzaker [GS01],
but the hope is that it’s enough to get by.
The basic idea of probability theory is that we want to model all possible
outcomes of whatever process we are studying simultaneously. This gives
the notion of a probability space, which is the set of all possible outcomes;
for example, if we roll two dice, the probability space would consist of all 36
possible combinations of values. Subsets of this space are called events; an
example in the two-dice space would be the event that the sum of the two
dice is 11, given by the set A = {h5, 6i , h6, 5i}. The probability of an event
A is given by a probability measure Pr [A]; for simple probability spaces,
this is just the sum of the probabilities of the individual outcomes contains
in A, while for more general spaces, we define the measure on events first
and the probabilities of individual outcomes are derived from this. Formal
definitions of all of these concepts are given later in this chapter.
When analyzing a randomized algorithm, the probability space describes
all choices of the random bits used by the algorithm, and we can think of the
possible executions of an algorithm as living within this probability space.
More formally, the sequence of operations carried out by an algorithm and the
output it ultimately produces are examples of random variables—functions
from a probability space to some other set—which we will discuss in detail
in Chapter 3.

15
CHAPTER 2. PROBABILITY THEORY 16

2.1 Probability spaces and events


A discrete probability space is a countable set Ω of points or outcomes
ω. Each ω in Ω has a probability Pr [ω], which is a real value with 0 ≤
Pr [ω] ≤ 1. It is required that ω∈Ω = 1.
P
P
An event A is a subset of Ω; its probability is Pr [A] = ω∈A Pr [ω].
We require that Pr [Ω] = 1, and it is immediate from the definition that
Pr [∅] = 0.
The complement Ā or ¬A of an event A is the event Ω − A. It is always
the case that Pr [¬A] = 1 − Pr [A].
This fact is a special case of the general principle that if A1 , A2 , . . . forms
a partition of Ω—that is, if Ai ∩ Aj = ∅ when i 6= j and Ai = Ω—then
S

Pr [Ai ] = Pr [Ω] = 1. It happens that ¬A and A form a partition of Ω


P

consisting of exactly two elements.


Even more generally, if A1 , A2 , . . . are disjoint events (that is, if Ai ∩Aj =
∅ whenever i = 6 j), then Pr [ Ai ] = Pr [Ai ]. This fact does not hold in
S P

general for events that are not disjoint.


For discrete probability spaces, all of these facts can be proven directly
from the definition of probabilities for events. For more general probability
spaces, it’s no longer possible to express the probability of an event as the
sum of the probabilities of its elements, and we adopt an axiomatic approach
instead.

2.1.1 General probability spaces


More general probability spaces consist of a triple (Ω, F, Pr) where Ω is
a set of points, F is a σ-algebra (a family of subsets of Ω that contains
Ω and is closed under complement and countable unions) of measurable
sets, and Pr is a function from F to [0, 1] that gives Pr [Ω] = 1 and satisfies
S P
countable additivity: when A1 , . . . are disjoint, Pr [ Ai ] = Pr [Ai ].
This definition is needed for uncountable spaces, because (under certain set-
theoretic assumptions) we may not be able to assign a meaningful probability
to all subsets of Ω.
Formally, this definition is often presented as three axioms of proba-
bility, due to Kolmogorov [Kol33]:

1. Pr [A] ≥ 0 for all A ∈ F.

2. Pr [Ω] = 1.
CHAPTER 2. PROBABILITY THEORY 17

3. For any countable collection of disjoint events A1 , A2 , . . . ,


" #
[ X
Pr Ai = Pr [Ai ] .
i i

It’s not hard to see that the discrete probability spaces defined in the
preceding section satisfy these axioms.
General probability spaces arise in randomized algorithms when we have
an algorithm that might consume an unbounded number of random bits.
The problem now is that an outcome consists of countable sequence of bits,
and there are uncountably many such outcomes. The solution is to consider
as measurable events only those sets with the property that membership
in them can be determined after a finite amount of time. Formally, the
probability space Ω is the set {0, 1}N of all countably infinite sequences of 0
and 1 values indexed by the natural numbers, and the measurable sets F
are all sets that can be generated by countable unions1 of cylinder sets,
where a cylinder set consists of all extensions xy of some finite prefix x. The
probability measure itself is obtained by assigning the set of all points that
start with x the probability 2−|x| , and computing the probabilities of other
sets from the axioms.2
An oddity that arises in general probability spaces is it may be that every
particular outcome has probability zero but their union has probability 1.
For example, the probability of any particular infinite string of bits is 0, but
the set containing all such strings is the entire space and has probability
1. This is where the fact that probabilities only add over countable unions
comes in.
Most randomized algorithms books gloss over general probability spaces,
with three good reasons. The first is that if we truncate an algorithm after
a finite number of steps, we are usually get back to a discrete probability
space, which avoids a lot of worrying about measurability and convergence.
The second is that we are often implicitly working in a probability space that
is either discrete or well-understood (like the space of bit-vectors described
above). The last is that the Kolmogorov extension theorem says that
1
As well as complements and countable intersections. However, it is not hard to show
that that sets defined using these operations can be reduced to countable unions of cylinder
sets.
2
This turns out to give the same probabilities as if we consider each outcome as a
real number in the interval [0, 1] and use Lebesgue measure to compute the probability of
events. For some applications, thinking of our random values as real numbers (or even
sequences of real numbers) can make things easier: consider for example what happens
when we want to choose one of three outcomes with equal probability.
CHAPTER 2. PROBABILITY THEORY 18

if we specify Pr [A1 ∩ A2 ∩ · · · ∩ Ak ] consistently for all finite sets of events


{A1 . . . Ak }, then there exists some probability space that makes these prob-
abilities work, even if we have uncountably many such events. So it’s usually
enough to specify how the events we care about interact, without worrying
about the details of the underlying space.

2.2 Boolean combinations of events


Even though events are defined as sets, we often think of them as representing
propositions that we can combine using the usual Boolean operations of NOT
(¬), AND (∧), and OR (∨). In terms of sets, these correspond to taking a
complement Ā = Ω \ A, an intersection A ∩ B, or a union A ∪ B.
We can use the axioms of probability to calculate the probability of Ā:

Lemma 2.2.1.
h i
Pr Ā = 1 − Pr [A] .

Proof. First note that A ∩ Ā = ∅, so A ∪ Ā =h Ωi is a disjoint union of


countably many3 events. This gives Pr [A] + Pr Ā = Pr [Ω] = 1.

For example, if our probability space consists of the six outcomes of a fair
i and A = [outcome is 3] with Pr [A] = 5/6, then Pr [outcome is not 3] =
diehroll,
Pr Ā = 1 − 1/6 = 5/6. Though this example is trivial, using the formula
does save us from having to add up the five cases where we don’t get 3.
If we want to know the probability of A ∩ B, we need to know more
about the relationship between A and B. For example, it could be that
A and B are both events representing a fair coin coming up heads, with
Pr [A] = Pr [B] = 1/2. The probability of A ∩ B could be anywhere between
1/2 and 0:

• For ordinary fair coins, we’d expect that half the time that A happens,
B also happens. This gives Pr [A ∩ B] = (1/2) · (1/2) = 1/4. To
make this formal, we might define our probability space Ω as having
four outcomes HH, HT, TH, and TT, each of which occurs with equal
probability.

• But maybe A and B represent the same fair coin: then A ∩ B = A and
Pr [A ∩ B] = Pr [A] = 1/2.
3
Countable need not be infinite, so 2 is countable.
CHAPTER 2. PROBABILITY THEORY 19

• At the other extreme, maybe A and B represent two fair coins welded
together so that if one comes up heads the other comes up tails. Now
Pr [A ∩ B] = 0.

• With a little bit of tinkering, we could also find probabilities for


the outcomes in our four-outcome space to make Pr [A] = Pr [HH] +
Pr [HT] = 1/2 and Pr [B] = Pr [HH] + Pr [TH] = 1/2 while setting
Pr [A ∩ B] = Pr [HH] to any value between 0 and 1/2.

The difference between the nice case where Pr [A ∩ B] equals 1/4 and
the other, more annoying cases where it doesn’t is that in the first case we
have assumed that A and B are independent, which is defined to mean
that Pr [A ∩ B] = Pr [A] Pr [B].
In the real world, we expect events to be independent if they refer to
parts of the universe that are not causally related: if we flip two coins that
aren’t glued together somehow, then we assume that the outcomes of the
coins are independent. But we can also get independence from events that
are not causally disconnected in this way. An example would be if we rolled
a fair four-sided die labeled HH, HT, TH, TT, where we take the first letter
as representing A and the second as B.
There’s no simple formula for Pr [A ∪ B] when A and B are not disjoint,
even for independent events, but we can compute the probability by splitting
up into smaller, disjoint events and using countable additivity:
h i
Pr [A ∪ B] = Pr (A ∩ B) ∪ (A ∩ B̄) ∪ (Ā ∩ B)
h i h i
= Pr [A ∩ B] + Pr A ∩ B̄ + Pr Ā ∩ B
 h i  h i 
= Pr [A ∩ B] + Pr A ∩ B̄ + Pr Ā ∩ B + Pr [A ∩ B] − Pr [A ∩ B]
= Pr [A] + Pr [B] − Pr [A ∩ B] .

The idea is that we can compute Pr [A ∪ B] by adding up the individual


probabilities and then subtracting off the part where the counted the event
twice.
This is a special case of the general inclusion-exclusion formula, which
says:
CHAPTER 2. PROBABILITY THEORY 20

Lemma 2.2.2. For any finite sequence of events A1 . . . An ,


" n #
[ X X X
Pr Ai = Pr [Ai ] − Pr [Ai ∩ Aj ] + Pr [Ai ∩ Aj ∩ Ak ] − . . .
i=1 i i<j i<j<k
" #
|S|+1
X \
= (−1) Pr Ai . (2.2.1)
S⊆{1...n},S6=∅ i∈S

Ω into 2n disjoint events BT , where BT = ( i∈T Ai ) ∩


T
Proof.
T Partition

/ Āi is the event that all Ai occur for i in T and no Ai occurs for i not
i∈T
in T . Then Ai is the union of all BT with T 3 i and Ai is the union of all
S

BT with T 6= ∅.
That the right-hand side gives the probability of this event is a sneaky con-
sequence of the binomial theorem, and in particular the fact that ni=1 (−1)i =
P
Pn i n
i=0 (−1) − 1 = (1 − 1) − 1 is −1 if n > 0 and 0 if n = 0. Using this fact
after rewriting the right-hand side using the BT events gives
" #
|S|+1
(−1)|S|+1
X \ X X
(−1) Pr Ai = Pr [BT ]
S⊆{1...n},S6=∅ i∈S S⊆{1...n},S6=∅ T ⊇S
 

(−1)|S|+1 
X X
= Pr [BT ]
T ⊆{1...n} S⊆T,S6=∅
n
!!
X X
i |T |
= − Pr [BT ] (−1)
T ⊆{1...n} i=1
i
 
− Pr [BT ] ((1 − 1)|T | − 1)
X
=
T ⊆{1...n}
 
Pr [BT ] (1 − 0|T | )
X
=
T ⊆{1...n}
X
= Pr [BT ]
T ⊆{1...n},T 6=∅
" n #
[
= Pr Ai .
i=1
CHAPTER 2. PROBABILITY THEORY 21

2.3 Conditional probability


The probability of A conditioned on B or probability of A given B,
written Pr [A | B], is defined by

Pr [A ∩ B]
Pr [A | B] = , (2.3.1)
Pr [B]

provided Pr [B 6= 0]. If Pr [B] = 0, we can’t condition on B.


Such conditional probabilities represent the effect of restricting our prob-
ability space to just B, which can think of as computing the probability of
each event if we know that B occurs. The intersection in the numerator
limits A to circumstances where B occurs, while the denominator normalizes
the probabilities so that, for example, Pr [Ω | B] = Pr [B | B] = 1.

2.3.1 Conditional probability and independence


Rearranging (2.3.1) gives Pr [A ∩ B] = Pr [B] Pr [A | B] = Pr [A] Pr [B | A].
In many cases, knowing that B occurs tells us nothing about whether A
occurs; if so, we have Pr [A | B] = Pr [B], which implies that Pr [A ∩ B] =
Pr [A | B] Pr [B] = Pr [A] Pr [B]—events A and B are independent. So
Pr [A | B] = Pr [A] gives an alternative criterion for independence when
Pr [B] is nonzero.4
A set of events A1 , A2 , . . . is independent if Ai is independent of B when
B is any Boolean formula of the Aj for j 6= i. The idea is that you can’t
predict Ai by knowing anything about the rest of the events.
A set of events A1 , A2 , . . . is pairwise independent if each Ai and Aj
are independent when i 6= j. It is possible for a set of events to be pairwise
independent but not independent; a simple example is when A1 and A2 are
the events that two independent coins come up heads and A3 is the event
that both coins come up with the same value. The general version of pairwise
independence is k-wise independence, which means that any subset of k
(or fewer) events are independent.

2.3.2 Conditional probability and the law of total probability


The reason we like conditional probability in algorithm analysis is that it
gives us a natural way to model the kind of case analysis that we are used
to applying to deterministic algorithms. Suppose we are trying to prove that
a randomized algorithm works (event A) with a certain probability. Most
4
If Pr [B] is zero, then A and B are always independent.
CHAPTER 2. PROBABILITY THEORY 22

likely, the first random thing the algorithm does is flip a coin, giving two
possible outcomes i B̄. Countable additivity tells us that Pr [A] =
h B and
Pr [A ∩ B] + Pr A ∩ B̄ , which we can rewrite using conditional probability
as
h i h i
Pr [A] = Pr [A | B] Pr [B] + Pr A B̄ Pr B̄ , (2.3.2)

a special case of the law of total probability.


h nicei about this expression is that we can often compute Pr [A | B]
What’s
and Pr A B̄ by looking at what the algorithm does starting from the
point where it has just gotten heads (B) or tails (B̄), and use the formula to
combine these values to get the overall probability of success.
For example, if

Pr [class occurs | snow] = 3/5,


Pr [class occurs | no snow] = 99/100, and
Pr [snow] = 1/10,

then

Pr [class occurs] = (3/5) · (1/10) + (99/100) · (1 − 1/10) = 0.951.

More generally, we can do the same computation for any partition of Ω


into countably many disjoint events Bi :
" #
[
Pr [A] = Pr (A ∩ Bi )
i
X
= Pr [A ∩ Bi ]
i
X
= Pr [A | Bi ] Pr [Bi ] , (2.3.3)
i,Pr[Bi ]6=0

which is the law of total probability. Note that the last step works for
each term only if Pr [A | Bi ] is well-defined, meaning that Pr [Bi ] 6= 0. But
any case where Pr [Bi ] = 0 also has Pr [A ∩ Bi ] = 0, so we get the correct
answer if we simply omit these terms h from i both sums.
A special case arises when Pr A B̄ = 0, which occurs, for example,
if A ⊆ B. Then we just have Pr [A] = Pr [A | B] Pr [B]. If we consider an
CHAPTER 2. PROBABILITY THEORY 23

event A = A1 ∩ A2 ∩ · · · ∩ Ak , then we can iterate this expansion to get

Pr [A1 ∩ A2 ∩ · · · ∩ Ak ] = Pr [A1 ∩ · · · ∩ Ak−1 ] Pr [Ak | A1 , . . . , Ak−1 ]


= Pr [A1 ∩ · · · ∩ Ak−2 ] Pr [Ak−1 | A1 , . . . , Ak−2 ] Pr [Ak | A1 , . . . , Ak−1 ]
= ...
k
Y
= Pr [Ai | A1 , . . . , Ai ] . (2.3.4)
i=1

Here Pr [A | B, C, . . .] is short-hand for Pr [B ∩ C ∩ . . .], the probability


that A occurs given that all of B, C, etc., occur.

2.3.3 Examples
Here we have some examples of applying conditional probability to algorithm
analysis. Mostly we will be using some form of the law of total probability.

2.3.3.1 Racing coin-flips


Suppose that I flip coins and allocate a space for each heads that I get before
the coin comes up tails. Suppose that you then supply me with objects (each
of which takes up one unit of space), one for each heads that you get before
you get tails. What are my chances of allocating enough space?
Let’s start by solving this directly using the law of total probability. Let
Ai be the event that I allocate i spaces. The event Ai is the intersection of
i independent events that I get heads in the first i positions and the event
that I get tails in position i + 1; this multiplies out to (1/2)i+1 . Let Bi be
the similar event that you supply i objects. Let W be the event that I win.
To make the Ai partition the space, we must also add an extra event A∞
equal to the singleton set {HHHHHHH . . .} consisting of the all-H sequence;
this has probability 0 (so it won’t have much of an effect), but we need to
include it since HHHHHHH . . . is not contained in any of the other Ai .
CHAPTER 2. PROBABILITY THEORY 24

We can compute

Pr [W | Ai ] = Pr [B0 ∩ B1 ∩ · · · ∩ Bi | Ai ]
= Pr [B0 ∩ B1 ∩ · · · ∩ Bi ]
= Pr [B0 ] + Pr [B1 ] + · · · + Pr [Bi ]
i
X
= (1/2)i
j=0
1 − (1/2)i+1
= (1/2) ·
1 − 1/2
= 1 − (1/2)i+1 . (2.3.5)

The clean form of this expression suggests strongly that there is a better
way to get it, and that this way involves taking the negation of the intersection
of i + 1 independent events that occur with probability 1/2 each. With a
little reflection, we can see that the probability that your objects don’t fit in
my buffer is exactly (1/2)i+1
From the law of total probability (2.3.3),

X
Pr [W ] = (1 − (1/2)i+1 )(1/2)i+1
i=0

X
=1− (1/4)i+1
i=0
1 1
=1− ·
4 1 − 1/4
= 2/3.

This gives us our answer. However, we again see an answer that is


suspiciously simple, which suggests looking for another way to find it. We
can do this using conditional probability by defining new events Ci , where
Ci contains all sequences of coin-flips for both players where get i heads in a
row but at least one gets tails on the (i + 1)-th coin. These events plus the
probability-zero event C∞ = {HHHHHHH . . . , TTTTTTT . . .} partition the
space, so Pr [W ] = ∞ i=0 Pr [W | Ci ] Pr [Ci ].
P

Now we ask, what is Pr [W | Ci ]? Here we only need to consider three


cases, depending on the outcomes of our (i + 1)-th coin-flips. The cases
hH, Ti and hT, Ti cause me to win, while the case hT, Hi causes me to
lose, and each case occurs with equal probability conditioned on Ci (which
excludes hH, Hi). So I win 2/3 of the time conditioned on Ci , and summing
CHAPTER 2. PROBABILITY THEORY 25

Pr [W ] = ∞
P
i=0 (2/3) Pr [Ci ] = 2/3 since Pr [Ci ] sums to 1, because the union
of these events includes the entire space except for the probability-zero event
C∞ .
Still another approach is to compute the probability that our runs have
exactly the same length ( ∞ −i · 2−i = 1/3), and argue by symmetry that
P
i=1 2
the remaining 2/3 probability is equally split between my run being longer
(1/3) and your run being longer (1/3). Since W occurs if my run is just as
long or longer, Pr[W ] = 1/3 + 1/3 = 2/3. A nice property of this approach is
that the only summation involved is over disjoint events, so we get to avoid
using conditional probability entirely.

2.3.3.2 Karger’s min-cut algorithm


Here we’ll give a simple algorithm for finding a global min-cut in a multi-
graph,5 due to David Karger [Kar93].
The idea is that we are given a multigraph G, and we want to partition
the vertices into nonempty sets S and T such that the number of edges with
one endpoint in S and one endpoint in T is as small as possible. There are
many efficient ways to do this, most of which are quite sophisticated. There
is also the algorithm we will now present, which solves the problem with
reasonable efficiency using almost no sophistication at all (at least in the
algorithm itself).
The main idea is that given an edge uv, we can construct a new multigraph
G1 by contracting the edge: in G1 , u and v are replaced by a single vertex,
and any edge that used to have either vertex as an endpoint now goes to the
combined vertex (edges with both endpoints in {u, v} are deleted). Karger’s
algorithm is to contract edges chosen uniformly at random until only two
vertices remain. All the vertices that got packed into one of these become
S, the others become T . It turns out that this finds a minimum cut with
probability at least 1/ n2 .
An example of the algorithm in action is given in Figure 2.1.

Theorem 2.3.1. Given any min cut (S, T ) of a graph G on n vertices,


n
Karger’s algorithm outputs (S, T ) with probability at least 1/ 2 .

Proof. Let (S, T ) be a min cut of size k. Then the degree of each vertex v
is at least k (otherwise (v, G − v) would be a smaller cut), and G contains
at least kn/2 edges. The probability that we contract an S–T edge is thus
at most k/(kn/2) = 2/n, and the probability that we don’t contract one is
5
Unlike ordinary graphs, multigraphs can have more than one edge between two vertices.
CHAPTER 2. PROBABILITY THEORY 26

a e
c d
b f
e
ab c d
f
e
ab c df

ab c def

abc def

Figure 2.1: Karger’s min-cut algorithm. Initial graph (at top) has min cut
h{a, b, c} , {d, e, f }i. We find this cut by getting lucky and contracting edges
ab, df , de, and ac in that order. The final graph (at bottom) gives the cut.

1 − 2/n = (n − 2)/n. Assuming we missed collapsing (S, T ) the first time,


we now have a new graph G1 with n − 1 vertices in which the min cut is still
of size k. So now the chance that we miss (S, T ) is (n − 3)/(n − 1). We stop
when we have two vertices left, so the last step succeeds with probability
1/3.
We can compute the probability that the S–T cut is never contracted by
applying (2.3.4), which just tells us to multiply all the conditional probabili-
ties together:
n
Y i−2 2
= .
i=3
i n(n − 1)

This tells us what happens when we are considering a particular min cut.
If the graph has more than one min cut, this only makes our life easier. Note
that since each min cut turns up with probability at least 1/ n2 , there can’t
be more than n2 of them.6 But even if there is only one, we have a good



6
The suspiciously combinatorial appearance of the 1/ n2 suggests that there should
be some way of associating minimum cuts with particular pairs of vertices, but I’m not
aware of any natural way to do this. Sometimes the appearance of a simple expression
in a surprising context may just stem from the fact that there aren’t very many distinct
simple expressions.
CHAPTER 2. PROBABILITY THEORY 27

chance of finding it if we simply re-run the algorithm substantially more


than n2 times.
Chapter 3

Random variables

Probabilities are fine when all we want to do is ask whether an algorithm


worked or not. But for many randomized algorithms (particularly Las Vegas
algorithms), we can structure the algorithm so that it works eventually
with probability 1. This makes the important question that of how long is
eventually. To measure a quantity like running time that depends on the
random choices of the algorithm, we need a random variable, which is just
a function whose domain is some probability space Ω.1
Even though random variables are just functions, rather than writing a
random variable as f (ω) everywhere, the convention is to write a random
variable as a capital letter (X, Y , S, etc.) and make the argument implicit:
X is really X(ω). Variables that aren’t random (or aren’t variable) are
written in lowercase.
Most of the random variables we will consider will be discrete random
variables. A discrete random variable takes on only countably many values,
each with some nonzero probability.
For example, consider the probability space corresponding to rolling two
independent fair six-sided dice. There are 36 possible outcomes in this space,
corresponding to the 6 × 6 pairs of values hx, yi we might see on the two dice.
We could represent the value of each die as a random variable X or Y given
by X(hx, yi) = x or Y (hx, yi) = y, but for many applications, we don’t care
so much about the specific values on each die. Instead, we want to know
the sum S = X + Y of the dice. This value S is also random variable; as a
function on Ω, it’s defined by S(hx, yi) = x + y.
1
Technically, this only works for discrete spaces. In general, a random variable is a
measurable function from a probability space (Ω, F) to some other set S equipped with
its own σ-algebra F 0 . What makes a function measurable in this sense is that that for any
set A in F 0 , the inverse image f −1 (A) must be in F. See §3.3 for more details.

28
CHAPTER 3. RANDOM VARIABLES 29

Random variables need not be real-valued. There’s no reason why we


can’t think of the pair hx, yi itself a random variable, whose range is the
set [1 . . . 6] × [1 . . . 6]. Similarly, if we imagine choosing a point uniformly at
random in the unit square [0, 1]2 , its coordinates are a random variable. For
a more exotic example, the random graph Gn,p obtained by starting with
n vertices and including each possible edge with independent probability p
is a random variable whose range is the set of all graphs on n vertices.

3.1 Operations on random variables


Random variables may be combined using standard arithmetic operators,
have functions applied to them, etc., to get new random variables. For
example, the random variable X/Y is a function from Ω that takes on the
value X(ω)/Y (ω) on each point ω.

3.2 Random variables and events


Any random variable X allows us to define events based on its possible values.
Typically these are expressed by writing a predicate involving the random
variable in square brackets. An example would be the probability that the
sum of two dice is exactly 11: [S = 11]; or that the sum of the dice is less than
5: [S < 5]. These are both sets of outcomes; we could expand [S = 11] =
{h5, 6i , h6, 5i} or [S < 5] = {h1, 1i , h1, 2i , h1, 3i , h2, 1i , h2, 2i , h3, 1i}. This
allows us to calculate the probability that a random variable has particular
2 1 6
properties: Pr [S = 11] = 36 = 18 and Pr [S < 5] = 36 = 61 .
Conversely, given any event A, we can define an indicator random
variable 1A that is 1 when A occurs and 0 when it doesn’t.2 Formally,
1A (ω) = 1 for ω in A and 1A (ω) = 0 for ω not in A.
Indicator variables are mostly useful when combined with other random
variables. For example, if you roll two dice and normally collect the sum of
the values but get nothing if it is 7, we could write your payoff as S · 1[S6=7] .
The probability mass function of a random variable gives Pr [X = x]
for each possible value x. For example, our random variable S has the
2
Some people like writing χA for these.
You may also see [P ] where P is some predicate, a convention known as Iverson
notation or the Iverson bracket that was invented by Iverson for the programming
language APL, appears in later languages like C where the convention is that true predicates
evaluate to 1 and false ones to 0, and ultimately popularized for use in mathematics—with
the specific choice of square brackets to set off the predicate—by Graham et al. [GKP88].
Out of these alternatives, I personally find 1A to be the least confusing.
CHAPTER 3. RANDOM VARIABLES 30

S Probability
2 1/36
3 2/36
4 3/36
5 4/36
6 5/36
7 6/36
8 5/36
9 4/36
10 3/36
11 2/36
12 1/36

Table 3.1: Probability mass function for the sum of two independent fair
six-sided dice

probability mass function show in Table 3.1. For a discrete random variable
X, the probability mass function gives enough information to calculate the
probability of any event involving X, since we can just sum up cases using
countable additivity. This gives us another way to compute Pr [S < 5] =
Pr [S = 2] + Pr [S = 3] + Pr [S = 4] = 1+2+3
36 = 16 .
For two random variables, the joint probability mass function gives
Pr [X = x ∧ Y = y] for each pair of values x and y. This generalizes in the
obvious way for more than two variables.
We will often refer to the probability mass function as giving the distri-
bution or joint distribution of a random variable or collection of random
variables, even though distribution (for real-valued variables) technically
refers to the cumulative distribution function F (x) = Pr [X ≤ x], which
is generally not directly computable from the probability mass function for
continuous random variables that take on uncountably many values. To
the extent that we can, we will try to avoid continuous random variables,
and the rather messy integration theory needed to handle them.
Two or more random variables are independent if all sets of events
involving different random variables are independent. In terms of proba-
CHAPTER 3. RANDOM VARIABLES 31

bility mass functions, X and Y are independent if Pr [X = x ∧ Y = y] =


Pr [X = x] · Pr [Y = y] for any constants x and y. In terms of cumulative
distribution functions, X and Y are independent if Pr [X ≤ x ∧ Y ≤ y] =
Pr [X = x] · Pr [Y = y] for any constants x and y. As with events, we gen-
erally assume that random variables associated with causally disconnected
processes are independent, but this is not the only way we might have
independence.
It’s not hard to see that the individual die values X and Y in our
two-dice example are independent, because every possible combination of
values x and y has the same probability 1/36 = Pr [X = x] Pr [Y = y]. If we
chose a different probability distribution on the space, we might not have
independence.

3.3 Measurability
For discrete probability spaces, any function on outcomes can be a random
variable. The reason is that any event in a discrete probability space has
a well-defined probability. For more general spaces, in order to be useful,
events involving a random variable should have well-defined probabilities.
For discrete random variables that take on only countably many values
(e.g., integers or rationals), it’s enough for the event [X = x] (that is, the
set {ω | X(ω) = x}) to be in F for all x. For real-valued random variables,
we ask that the event [X ≤ x] be in F. In these cases, we say that X is
measurable with respect to F, or just measurable F. More exotic random
variables use a definition of measurability that generalizes the real-valued
version, which we probably won’t need.3 Since we usually just assume that
all of our random variables are measurable unless we are doing something
funny with F to represent ignorance, this issue won’t come up much.

3.4 Expectation
The expectation or expected value of a random variable X is given by
P
E [X] = x x Pr [X = x]. This is essentially an average value of X weighted
by probability, and it only makes sense if X takes on values that can be
summed in this way (e.g., real or complex values, or vectors in a real- or
3
The general version is that if X takes on values on another measure space (Ω0 , F 0 ),
then the inverse image X −1 (A) = {ω ∈ Ω | X(ω) ∈ A} of any set A in F 0 is in F. This
means in particular that PrΩ maps through X to give a probability measure on Ω0 by
PrΩ0 [A] = PrΩ [X −1 (A)], and the condition on X −1 (A) being in F makes this work.
CHAPTER 3. RANDOM VARIABLES 32

complex-valued vector space). Even if the expectation makes sense, it may


be that a particular random variable X doesn’t have an expectation, because
the sum fails to converge.4
For an example that does work, ifXand Y are independent fair six-sided
dice, then E [X] = E [Y ] = 6k=1 k 16 = 21 7
P
6 = 2 , while E [X + Y ] is the
rather horrific
12
X 2 · 1 + 3 · 2 + 4 · 3 + 5 · 4 + · · · + 9 · 4 + 10 · 3 + 11 · 2 + 12 · 1
k · Pr [X + Y = i] =
k=2
36
252
= = 7.
36
The fact that 7 = 72 + 72 here is not a coincidence, but a consequence of
linearity of expectation, which is the subject of the next section.

3.4.1 Linearity of expectation


The main reason we like expressing the run times of algorithms in terms of
expectation is linearity of expectation: E [aX + bY ] = E [aX] + E [bY ]
for all random variables X and Y for which E [X] and E [Y ] are defined, and
all constants a and b. This means that we can compute the running time for
different parts of our algorithm separately and then add them together, even
if the costs of different parts of the algorithm are not independent.
P P
The general version is E [ ai Xi ] = ai E [Xi ] for any finite collection of
random variables Xi and constants ai , which follows by applying induction to
the two-variable case. A special case is E [cX] = c E [X] when c is constant.
For discrete random variables, linearity of expectation follows immediately
from the definition of expectation and the fact that the event [X = x] is the
4
Example: Let X be the number of times you flipa fairPcoin until it comes
P∞ up heads.

We’ll see later than E [X] = 1/(1/2) = 2. But E 2X = n=1 2n 2−n = n=1 1, which
diverges. With some tinkering it is possible to come up with even uglier cases, like an
array that contains 1 element on average but requires infinite expected time to sort using
a Θ(n log n) algorithm.
CHAPTER 3. RANDOM VARIABLES 33

disjoint union of the events [X = x, Y = y] for all y:


X
E [aX + bY ] = (ax + by) Pr [X = x ∧ Y = y]
x,y
X X
=a x Pr [X = x, Y = y] + b y Pr [X = x, Y = y]
x,y x,y
X X X X
=a x P r[X = x, Y = y] + b y Pr [X = x, Y = y]
x y y x
X X
=a x Pr [X = x] + b y Pr [Y = y]
x y

= a E [X] + b E [Y ] .

A technical note: we are assuming that E [X], E [Y ], and E [X + Y ] all


exist.
This proof does not require that X and Y be independent. The sum
of two fair six-sided dice always has expectation 72 + 72 = 7, whether they
are independent dice, the same die counted twice, or one die X and its
complement 7 − X.
Linearity of expectation makes it easy to compute the expectations of
random variables that can be expressed as sums of other random variables.
One example that will come up a lot is a binomial random variable, which
is a sum S = ni=1 Xi of n independent Bernoulli random variables, each
P

of which is 1 with probability p and 0 with probability q = 1 − p. These are


called binomial random variables because the probability that S is equal to
k is given by !
n k n−k
Pr [S = k] = p q , (3.4.1)
k
which is the k-th term in the binomial expansion of (p + q)n . In this case
each Xi has E [Xi ] = p, so E [S] is just np. It is possible to calculate this
fact directly from (3.4.1), but it’s much more work.5

3.4.1.1 Linearity of expectation for infinite sequences


For infinite sequences of random variables, linearity of expectation may
break down. This is true even if the sequence is countable. An example
5
P∞One way isk to Puse the k probability generating function F (z) =
∞ n
k=0
Pr [S = k] z = k=0 k
p q n−k z k = (pz + q)n . Then take the derivative
0
P∞ k−1 0
P∞
F (z) = k=0
Pr [S = k] kz and observe F (1) = k=0
Pr [S = k] k = E [S]. Or we
can write F 0 (z) as n(pz + q)n−1 p, which gives F 0 (1) = n(p + q)n−1 p = np.
CHAPTER 3. RANDOM VARIABLES 34

is the St. Petersburg paradox, in which a gambler bets $1 on a double-


or-nothing game, then bets $2 if they lose, then $4, and so on, until they
eventually wins and stops, up $1. If we represent the gambler’s gain or loss
at stage i as a random variable Xi , it’s easy to show that E [Xi ] = 0, because
the gambler either wins ±2i with equal probability, or doesn’t play at all.
So ∞
P P∞
i=0 E [Xi ] = 0. But E [ i=0 Xi ] = 1, because the probability that the
gambler doesn’t eventually win is zero.6
Fortunately, these pathological cases don’t come up often in algorithm
analysis, and with some additional side constraints we can apply linearity of
expectation even to infinite sums of random variables. The simplest is when
Xi ≥ 0 for all i; then E [ ∞
P P∞
i=0 Xi ] exists and is equal to i=0 E [Xi ] whenever
the sum of the expectations converges (this is a consequence of the monotone
convergence theorem). Another condition that works is if | ni=0 Xi | ≤ Y for
P

all n, where Y is a random variable with finite expectation; the simplest


version of this is when Y is constant. See [GS92, §5.6.12] or [Fel71, §IV.2]
for more details.

3.4.2 Expectation and inequalities


If X ≤ Y (that is, if the event [X ≤ Y ] holds with probability 1), then
E [X] ≤ E [Y ]. For finite discrete spaces the proof is trivial:
X
E [X] = Pr [ω] X(ω)
ω∈Ω
X
≤ Pr [ω] Y (ω)
ω∈Ω
= E [Y ] .

The claim continues to hold even in the general case, but the proof is
more work.
One special case of this that comes up often is that X ≥ 0 implies
E [X] ≥ 0.
6
The trick here is that we are trading aPprobability-1 gain of 1 against a probability-0


loss of ∞. So we could declare that E i=0
Xi involves 0 · (−∞) and is undefined.
But this would lose the useful property that expectation isn’t affected by probability-0
outcomes. As often happens in mathematics, we are forced to choose between candidate
definitions based on which bad consequences we most want to avoid, with no way to avoid
all of them. So the standard definition of expectation allows the St. Petersburg paradox
because the alternatives are worse.
CHAPTER 3. RANDOM VARIABLES 35

3.4.3 Expectation of a product


When two random variables X and Y are independent, it also holds that
E [XY ] = E [X] E [Y ]. The proof (at least for discrete random variables) is
straightforward:
XX
E [XY ] = xy Pr [X = x ∧ Y = y]
x y
XX
= xy Pr [X = x] Pr [Y = y]
x y
! !
X X
= x Pr [X = x] Pr [Y = y]
x y

= E [X] E [Y ] .

For example, the expectation of the product of two independent fair


 2
six-sided dice is 27 = 49 4 .
This is not true for arbitrary random variables. If we compute the
expectation of the product of a single fair six-sided die with itself, we get
1·1+2·2+3·3+4·4+5·5+6·6
6 = 91
6 , which is much larger.
One measure of the dependence between two random variables by is the
difference E [XY ] − E [X] · E [Y ]. This is called the covariance of X and Y ,
written Cov [X, Y ], and it is 0 when X and Y are independent and nonzero
otherwise. Covariance will come back later when we look at concentration
bounds in Chapter 5.

3.4.3.1 Wald’s equation (simple version)


Computing the expectation of a product does not often come up directly in
the analysis of a randomized algorithm. Where we might expect to do it is
when we have a loop: one random variable N tells us the number of times
we execute the loop, while another random variable X tells us the cost of
each iteration. The problem is that if each iteration is randomized, then we
really have a sequence of random variables X1 , X2 , . . . , and what we want
to calculate is
N
" #
X
E Xi . (3.4.2)
i=1

Here we can’t use the sum formula directly, because N is a random variable,
and we can’t use the product formula, because the Xi are all different random
variables.
CHAPTER 3. RANDOM VARIABLES 36

If N and the Xi are all independent (which may or may not be the case
for the loop example), and N is bounded by some fixed maximum n, then
we can apply the product rule to get the value of (3.4.2) by throwing in
a few indicator variables. The idea is that the contribution of Xi to the
sum is given by Xi 1[N ≥i] , and because we assume that N is independent
h i
of the Xi , if we need to compute E Xi 1[N ≥i] , we can do so by computing
h i
E [Xi ] E 1[N ≥i] .
So we get
"N # " n #
X X
E Xi = E Xi 1[N ≥i]
i=1 i=1
n
X h i
= E Xi 1[N ≥i]
i=1
n
X h i
= E [Xi ] E 1[N ≥i] .
i=1

For general Xi we have to stop here. But if we also know that the Xi all
have the same expectation µ, then E [Xi ] doesn’t depend on i and we can
bring it out of the sum. This gives
n
X h i n
X h i
E [Xi ] E 1[N ≥i] = µ E 1[N ≥i]
i=1 i=1
= µ E [N ] . (3.4.3)
This equation is a special case of Wald’s equation, which we will see
again in §9.4.2. The main difference between this version and the general
version is that here we had to assume that N was independent of the Xi ,
which may not be true if our loop is a while loop, and termination after a
particular iteration is correlated with the time taken by that iteration.
But for simple cases, (3.4.3) can still be useful. For example, if we throw
one six-sided die to get N , and then throw N six-sided dice and add them
up, we get the same expected total 72 · 72 = 49 4 as if we just multiply two
six-sided dice. This is true even though the actual distribution of the values
is very different in the two cases.

3.5 Conditional expectation


We can also define a notion of conditional expectation, analogous to
conditional probability. There are three versions of this, depending on how
CHAPTER 3. RANDOM VARIABLES 37

fancy we want to get about specifying what information we are conditioning


on.

3.5.1 Expectation conditioned on an event


The expectation of X conditioned on an event A is written E [X | A] and
defined by
X X Pr [X = x ∧ A]
E [X | A] = x Pr [X = x | A] = x . (3.5.1)
x x Pr [A]

This is essentially the weighted average value of X if we know that A occurs.


Most of the properties that we see with ordinary expectations continue
to hold for conditional expectation. For example, linearity of expectation

E [aX + bY | A] = a E [X | A] + b E [Y | A] (3.5.2)

holds whenever a and b are constant on A.


Similarly if Pr [X ≥ Y ] A = 1, E [X | A] ≥ E [Y | A].
Conditional expectation is handy because we can use it to compute
expectations by case analysis the same way we use conditional probabilities
using the law of total probability (see §2.3.2). If A1 , A2 , . . . are a countable
partition of Ω, then
!
X X X
Pr [Ai ] E [X | Ai ] = Pr [Ai ] x Pr [X = x | Ai ]
i i x
!
X X Pr [X = x ∧ Ai ]
= Pr [Ai ] x
i x Pr [A]i
XX
= x Pr [X = x ∧ Ai ]
i x
!
X X
= x Pr [X = x ∧ Ai ]
x i
X
= x Pr [X = x]
x
= E [X] . (3.5.3)

This is actually a special case of the law of iterated expectation, which


we will see in the next section.
CHAPTER 3. RANDOM VARIABLES 38

3.5.2 Expectation conditioned on a random variable


In the previous section, we considered computing E [X] by breaking it up
into disjoint cases E [X | A1 ], E [X | A2 ], etc. But keeping track of all the
events in our partition of Ω is a lot of book-keeping. Conditioning on a
random variable lets us combine all these conditional probabilities into a
single expression
E [X | Y ] ,
the expectation of X conditioned on Y , which is defined to have the
value E [X | Y = y] whenever Y = y.7
Note that E [X | Y ] is generally a function of Y , unlike E [X] which is a
constant. This also means that E [X | Y ] is a random variable, and its value
can depend on which outcome ω we picked from our probability space Ω.
The intuition behind the definition is that E [X | Y ] is the best estimate we
can make of X given that we know the value of Y but nothing else.
If we want to be formal about the definition, we can specific the value of
E [X | Y ] explicitly for each point ω ∈ Ω:

E [X | Y ] (ω) = E [X | Y = Y (ω)] . (3.5.4)

This is just another way of saying what we said already: if you want to know
what the expectation of X is conditioned on Y when you get outcome ω,
find the value of Y at ω and condition on seeing that.
Here is a simple example. Suppose that X and Y are independent fair
coin-flips that take on the values 0 and 1 with equal probability. Then our
probability space Ω has four elements, and looks like this:

h0, 0i h0, 1i
h1, 0i h1, 1i

where each tuple hx, yi gives the values of X and Y .


We can also define the total number of heads as Z = X + Y . If we label
all the points ω in our probability space with Z(ω), we get a picture that
looks like this:
0 1
1 2
This is what we see if we know the exact value of both coin-flips (or at
least the exact value of Z).
7
If Y is not discrete, the situation is more complicated. See [Fel71, §§III.2 and V.9–V.11]
or [GS92, §7.9].
CHAPTER 3. RANDOM VARIABLES 39

But now suppose we only know X, and want to compute E [Z | X]. When
X = 0, E [Z | X = 0] = 12 · 0 + 12 · 1 = 12 ; and when X = 1, E [Z] X = 1 =
1 1 3
2 · 1 + 2 · 2 = 2 . So drawing E [Z | X] over our probability space gives

1 1
2 2
3 3
2 2

We’ve averaged the value of Z across each row, since each row corresponds
to one of the possible values of X.
If instead we compute E [Z | Y ], we get this picture instead:
1 3
2 2
1 3
2 2

Now instead of averaging across rows (values of X) we average across


columns (values of Y ). So the left column shows E [Z | Y = 0] = 12 and the
right columns shows E [Z | Y = 1] = 32 , which is pretty much what we’d
expect.
Nothing says that we can only condition on X and Y . What happens if
we condition on Z?
Now we are going to get fixed values for each possible value of Z. If we
compute E [X | Z], then when Z = 0 this will be 0 (because Z = 0 implies
X = 0), and when Z = 2 this will be 1 (because Z = 2 implies X = 1). The
middle case is E [X | Z = 1] = 12 · 0 + 12 · 1 = 12 , because the two outcomes
h0, 1i and h1, 0i that give Z = 1 are equally likely. The picture is
1
0 2
1
2 1

3.5.2.1 Calculating conditional expectations


Usually we will not try to E [X | Y ] individually for each possible ω or even
each possible value y of Y . Instead, we can use various basic facts to compute
E [X | Y ] by applying arithmetic to random variables.
The two basic facts to start with are:

1. If X is a function of Y , then E [X | Y ] = X. Proof: Suppose X =


f (Y ). From (3.5.4), for each outcome ω, we have E [X | Y ] (ω) =
E [X | Y = Y (ω)] = E [f (Y (ω)) | Y = Y (ω)] = f (Y (ω)) = X(ω).

2. If X is independent of Y , then E [X | Y ] = E [X]. Proof: Now for each


ω, we have E [X | Y ] (ω) = E [X | Y = Y (ω)] = E [X].
CHAPTER 3. RANDOM VARIABLES 40

We also have a rather strong version of linearity of expectation. If A and


B are both functions of Z, then
E [AX + BY | Z] = A E [X | Z] + B E [Y | Z] . (3.5.5)
Here is a proof for discrete probability spaces. For each value z of Z, we
have
X Pr [ω]
E [A(Z)X + B(Z)Y | Z = z] = (A(z)X(ω) + B(z)Y (ω)
−1
Pr [Z = z]
ω∈Z (z)
X Pr [ω]
= A(z) X(ω)
Pr [Z = z]
ω∈Z −1 (z)
X Pr [ω]
+ B(z) Y (ω)
Pr [Z = z]
ω∈Z −1 (z)

= A(z) E [X | Z = z] + B(z) E [Y | Z = z] ,
which is the value when Z = z of A E [X | Z] + B E [Y | Z].
This means that we can quickly simplify many conditional expectations.
If we go back to the example of the previous section, where Z = X + Y is
the sum of two independent fair coin-flips X and Y , then we can compute
E [Z | X] = E [X + Y | X]
= E [X | X] + E [Y | X]
= X + E [Y ]
1
=X+ .
2
Similarly, E [Z | Y ] = E [X + Y | Y ] = 12 + Y .
In some cases we have enough additional information to run this in
reverse. If we know Z and want to estimate X, we can use the fact that
X and Y are symmetric to argue that E [X | Z] = E [Y | Z]. But then
Z = E [X + Y | Z] = 2 E [X | Z], so E [X | Z] = Z/2. Note that this works
in general only if the events [X = a, Y = b] and [X = b, Y = a] have the
same probabilities for all a and b even if we condition on Z, which in this
case follows from the fact that X and Y are independent and identically
distributed and that addition is commutative. Other cases may be messier.8
Other facts, like X ≥ Y implies E [X | Z] ≥ E [Y | Z], can be proved
using similar techniques.
8
An example with X and Y identically distributed but not independent is to imagine
that we roll a six-sided die to get X, and let Y = X + 1 if X < 6 and Y = 1 if X = 6. Now
knowing Z = X + Y = 3 tells me that X = 1 and Y = 2 exactly, neither of which is Z/2.
CHAPTER 3. RANDOM VARIABLES 41

3.5.2.2 The law of iterated expectation


The law of iterated expectation says that

E [X] = E [E [X | Y ]] . (3.5.6)

When Y is discrete, this is just (3.5.3) in disguise. For each of the


(countably many) values of Y that occurs with nonzero probability, let Ay
be the event [Y = y]. Then these events are a countable partition of Ω, and
X
E [E [X | Y ]] = Pr [Y = y] E [E [X | Y ] | Y = y]
y
X
= Pr [Y = y] E [X | Y = y]
y

= E [X] .

The trick here is that we use (3.5.3) to expand out the original expression in
terms of the events Ay , then notice that E [X | Y ] is equal to E [X | Y = y]
whenever Y = y.
So as claimed, conditioning on a variable gives a way to write averaging
over cases very compactly.
It’s also not too hard to show that iterated expectation works with partial
conditioning:
E [E [X | Y, Z] | Y ] = E [X | Y ] . (3.5.7)

3.5.2.3 Conditional expectation as orthogonal projection


If you are comfortable with linear algebra, it may be helpful to think about
expectation conditioned on a random variable as a form of projection onto
a subspace. In this section, we’ll give a brief description of how this works
for a finite, discrete probability space. For a more general version, see for
example [GS92, §7.9].
Consider the set of all real-valued random variables on the probability
space {TT, TH, HT, HH} corresponding to flipping two independent fair coins.
We can think of each such random variable as a vector in R4 , where the
four coordinates give the value of the variable on the four possible outcomes.
For example, the indicator variable for the event that the first coin is heads
would look like X = h0, 0, 1, 1i and the indicator variable for the event that
the second coin is heads would look like Y = h0, 1, 0, 1i.
When we add two random variables together, we get a new random vari-
able. This corresponds to vector addition: X + Y = h0, 0, 1, 1i + h0, 1, 0, 1i =
CHAPTER 3. RANDOM VARIABLES 42

h0, 1, 1, 2i. Multiplying a random variable by a constant looks like scalar


multiplication 2X = 2 · h0, 0, 1, 1i = h0, 0, 2, 2i. Because random variables
support both addition and scalar multiplication, and because these operations
obey the axioms of a vector space, we can treat the set of all real-valued
random variables defined on a given probability space as a vector space, and
apply all the usual tools from linear algebra to this vector space.
One thing in particular we can look at is subspaces of this vector space.
Consider the set of all random variables that are functions of X. These are
vectors of the form ha, a, b, bi, and adding any two such vectors or multiplying
any such vector by a scalar yields another vector of this form. So the functions
of X form a two-dimensional subspace of the four-dimensional space of all
random variables. An even lower-dimensional subspace is the one-dimensional
subspace of constants: vectors of the form ha, a, a, ai. As with functions of
X, this set is closed under addition and multiplication by a constant.
When we take the expectation of X, we are looking for a constant
that gives us the average value of X. In vector terms, this means that
E [h0, 0, 1, 1i] = h1/2, 1/2, 1/2, 1/2i. This expectation vector is in fact the
orthogonal projection of h0, 0, 1, 1i onto the subspace generated by 1 =
h1, 1, 1, 1i; we can tell this because the dot-product of X − E [X] with 1
is h−1/2, −1/2, 1/2, 1/2i · h1, 1, 1, 1i = 0. If instead we take a conditional
expectation, we are again doing an orthogonal projection, but now onto a
higher-dimensional subspace. So E [X + Y | X] = h1/2, 1/2, 3/2, 3/2i is the
orthogonal projection of X + Y = h0, 1, 1, 2i onto the space of all functions
of X, which is generated by the basis vectors h0, 0, 1, 1i and h1, 1, 0, 0i. As
in the simple expectation, the dot-product of (X + Y ) − E [X + Y | X] with
either of these basis vectors is 0.
Many facts about conditional expectation translate in a straightforward
way to facts about projections. Linearity of expectation is equivalent to
linearity of projection onto a subspace. The law of iterated expectation
E [E [X | Y ]] = E [X] says that projecting onto the subspace of functions of
Y and then onto the subspace of constants is equivalent to projection directly
down to the subspace of constants; this is true in general for projection
operations. It’s also possible to represent other features of probability spaces
in terms of expectations; for example, E [XY ] acts like an inner product
for random variables, E X 2 acts like the square of the Euclidean distance,
 

and the fact that E [X] is an orthogonal projection of X means that E [X]
is precisely the constant value µ that minimizes the distance E (X − µ)2 .
We won’t actually use any of these facts in the following, but having another
way to look at conditional expectation may be helpful in understanding how
it works.
CHAPTER 3. RANDOM VARIABLES 43

3.5.3 Expectation conditioned on a σ-algebra


Expectation conditioned on a random variable is actually a special case of
the expectation of X conditioned on a σ-algebra F. Recall that a σ-algebra is
a family of subsets of Ω that includes Ω and is closed under complement and
countable union; for discrete probability spaces, this turns out to be the set
of all unions of equivalence classes for some equivalence relation on Ω,9 and
we think of F as representing knowledge of which equivalence class we are in,
but not which point in the equivalence class we land on. An example would
be if Ω consists of all values (X1 , X2 ) obtained from two die rolls, and F
consists of all sets A such that whenever one point ω with X1 (ω) + X2 (ω) = s
is in A, so is every other point ω 0 with X1 (ω 0 ) + X2 (ω 0 ) = s. (This is the
σ-algebra generated by the random variable X1 + X2 .)
A discrete random variable X is measurable with respect to F, or
F-measurable, if every event [X = x] is contained in F; in other words,
knowing only where we are in F, we can compute exactly the value of X.
This gives a formal way to define σ(X): it is the smallest σ-algebra F such
that X is F-measurable.
If X is not F-measurable, the best approximation we can make to it given
that we only know where we are in F is E [X | F], which is defined as a random
variable Q that is (a) F-measurable; and (b) satisfies E [Q | A] = E [X | A]
for any event A ∈ F with Pr [A] 6= 0.
For discrete probability spaces, this just means that we replace X with
its average value across each equivalence class: property (a) is satisfied
because E [X | F] is constant across each equivalence class, meaning that
[E [X | F] = x] is a union of equivalence classes, and property (b) is satisfied
because we define E [E [X | F] | A] = E [X | A] for each equivalence class A,
and the same holds for unions of equivalence classes by a simple calculation.
This gives the same result as E [X | Y ] if F is generated by Y , or more
generally as E [X | Y1 , Y2 , . . .] if F is generated by Y1 , Y2 , . . . . In each case
the intuition is that we are getting the best estimate we can for X given the
information we have. It is also possible to define E [X | F] as a projection
onto the subspace of all random variables that are F-measurable, analogously
to the special case for E [X | Y ] described in §3.5.2.3.
9
Proof: Let F be a σ-algebra over a countable set Ω. Let ω ∼ ω 0 if, for all A in F,
ω ∈ A if and only if ω 0 ∈ A; this is an equivalence relation on Ω. To show that the
equivalence classes of ∼ are elementsTof F, for each ω 00 6∼ ω, let Aω00 be some element of F
that contains ω but not ω 00 . Then ω00 Aω00 (a countable intersection of elements of F)
contains ω and all points ω 0 ∼ ω but no points ω 00 6∼ ω; in other words, it’s the equivalence
class of ω. Since there are only countably many such equivalence classes, we can construct
all the elements of F by taking all possible unions of them.
CHAPTER 3. RANDOM VARIABLES 44

Sometimes it is convenient to use more than one σ-algebra to represent


increasing knowledge over time. A filtration is a sequence of σ-algebras
F0 ⊆ F1 ⊆ F2 ⊆ . . . , where each Ft represents the information we have
available at time t. That each Ft is a subset of Ft+1 means that any event
we can determine at time t we can also determine at all future times t0 > t:
though we may learn more information over time, we never forget what
we already know. A common example of a filtration is when we have a
sequence of random variables X1 , X2 , . . . , and define Ft as the σ-algebra
hX1 , X2 , . . . , Xt i generated by X1 , X2 , . . . , Xt .
When one σ-algebra is a subset of another, a version of the law of iterated
expectation applies: F ∈ F 0 implies E [E [X | F 0 ] | F] = E [X | F]. One way
to think about this is that if we forget everything about X we can’t predict
from F 0 and then forget everything that’s left that we can’t predict from
F, we get to the same place as if we just forget everything except F to
begin with. The simplest version E [X | F 0 ] = E [X] is just what happens
when F is the trivial σ-algebra {∅, Ω}, where all we know is that something
happened, but we don’t know what.

3.5.4 Examples
• Let X be the value of a six-sided die. Let A be the event that X is
even. Then
X
E [X | A] = x Pr [X = x | A]
x
1
= (2 + 4 + 6) ·
3
= 4.

• Let X and Y be independent six-sided dice, and let Z = X + Y . Then


E [Z | X] is a random variable whose value is 1 + 7/2 when X = 1,
2 + 7/2 when X = 2, etc. We can write this succinctly by writing
E [Z | X] = X + 7/2.

• Conversely, if X, Y , and Z are as above, we can also compute E [X] Z.


Here we are told what Z is and must make an estimate of X.
For some values of Z, this nails down X completely: E [X | Z = 2] = 1
because X you can only make 2 in this model as 1 + 1. For other values,
we don’t know much about X, but can still compute the expectation.
For example, to compute E [X | Z = 5], we have to average X over all
the pairs (X, Y ) that sum to 5. This gives E [X | Z = 5] = 14 (1 + 2 + 3 +
CHAPTER 3. RANDOM VARIABLES 45

4) = 52 . (This is not terribly surprising, since by symmetry E [Y | Z = 5]


should equal E [X | Z = 5], and since conditional expectations add
just like regular expectations, we’d expect that the sum of these two
expectations would be 5.)
The actual random variable E [X | Z] summarizes these conditional
expectations for all events of the form [Z = z]. Because of the symmetry
argument above, we can write it succinctly as E [X | Z] = Z2 . Or we
could list its value for every ω in our underlying probability space, as
in done in Table 3.2 for this and various other conditional expectations
on the two-independent-dice space.

3.6 Applications
3.6.1 Yao’s lemma
In Section 1.1, we considered a special case of the unordered search problem,
where we have an unordered array A[1..n] and want to find the location
of a specific element x. For deterministic algorithms, this requires probing
n array locations in the worst case, because the adversary can place x in
the last place we look. Using a randomized algorithm, we can reduce this
to (n + 1)/2 probes on average, either by probing according to a uniform
random permutation or just by probing from left-to-right or right-to-left
with equal probability.
Can we do better? Proving lower bounds is a nuisance even for determin-
istic algorithms, and for randomized algorithms we have even more to keep
track of. But there is a sneaky trick that allows us to reduce randomized
lower bounds to deterministic lower bounds in many cases.
The idea is that if we have a randomized algorithm that runs in time
T (x, r) on input x with random bits r, then for any fixed choice of r we
have a deterministic algorithm. So for each n, we find some random X
with |X| = n and show that, for any deterministic algorithm that runs in
time T 0 (x), E [T 0 (X)] ≥ f (n). But then E [T (X, R)] = E [E [T (X, R) | R]] =
E [E [TR (X) | R]] ≥ f (n).
This gives us Yao’s lemma:
Lemma 3.6.1 (Yao’s lemma (informal version)[Yao77]). Fix some problem.
Suppose there is a random distribution on inputs X of size n such that every
deterministic algorithm for the problem has expected cost T (n).
Then the worst-case expected cost of any randomized algorithm is at least
T (n).
CHAPTER 3. RANDOM VARIABLES 46

ω X Y Z =X +Y E [X] E [X | Y = 3] E [X | Y ] E [Z | X] E [X | Z] E [X | X]
(1, 1) 1 1 2 7/2 7/2 7/2 1 + 7/2 2/2 1
(1, 2) 1 2 3 7/2 7/2 7/2 1 + 7/2 3/2 1
(1, 3) 1 3 4 7/2 7/2 7/2 1 + 7/2 4/2 1
(1, 4) 1 4 5 7/2 7/2 7/2 1 + 7/2 5/2 1
(1, 5) 1 5 6 7/2 7/2 7/2 1 + 7/2 6/2 1
(1, 6) 1 6 7 7/2 7/2 7/2 1 + 7/2 7/2 1
(2, 1) 2 1 3 7/2 7/2 7/2 2 + 7/2 3/2 2
(2, 2) 2 2 4 7/2 7/2 7/2 2 + 7/2 4/2 2
(2, 3) 2 3 5 7/2 7/2 7/2 2 + 7/2 5/2 2
(2, 4) 2 4 6 7/2 7/2 7/2 2 + 7/2 6/2 2
(2, 5) 2 5 7 7/2 7/2 7/2 2 + 7/2 7/2 2
(2, 6) 2 6 8 7/2 7/2 7/2 2 + 7/2 8/2 2
(3, 1) 3 1 4 7/2 7/2 7/2 3 + 7/2 4/2 3
(3, 2) 3 2 5 7/2 7/2 7/2 3 + 7/2 5/2 3
(3, 3) 3 3 6 7/2 7/2 7/2 3 + 7/2 6/2 3
(3, 4) 3 4 7 7/2 7/2 7/2 3 + 7/2 7/2 3
(3, 5) 3 5 8 7/2 7/2 7/2 3 + 7/2 8/2 3
(3, 6) 3 6 9 7/2 7/2 7/2 3 + 7/2 9/2 3
(4, 1) 4 1 5 7/2 7/2 7/2 4 + 7/2 5/2 4
(4, 2) 4 2 6 7/2 7/2 7/2 4 + 7/2 6/2 4
(4, 3) 4 3 7 7/2 7/2 7/2 4 + 7/2 7/2 4
(4, 4) 4 4 8 7/2 7/2 7/2 4 + 7/2 8/2 4
(4, 5) 4 5 9 7/2 7/2 7/2 4 + 7/2 9/2 4
(4, 6) 4 6 10 7/2 7/2 7/2 4 + 7/2 10/2 4
(5, 1) 5 1 6 7/2 7/2 7/2 5 + 7/2 6/2 5
(5, 2) 5 2 7 7/2 7/2 7/2 5 + 7/2 7/2 5
(5, 3) 5 3 8 7/2 7/2 7/2 5 + 7/2 8/2 5
(5, 4) 5 4 9 7/2 7/2 7/2 5 + 7/2 9/2 5
(5, 5) 5 5 10 7/2 7/2 7/2 5 + 7/2 10/2 5
(5, 6) 5 6 11 7/2 7/2 7/2 5 + 7/2 11/2 5
(6, 1) 6 1 7 7/2 7/2 7/2 6 + 7/2 7/2 6
(6, 2) 6 2 8 7/2 7/2 7/2 6 + 7/2 8/2 6
(6, 3) 6 3 9 7/2 7/2 7/2 6 + 7/2 9/2 6
(6, 4) 6 4 10 7/2 7/2 7/2 6 + 7/2 10/2 6
(6, 5) 6 5 11 7/2 7/2 7/2 6 + 7/2 11/2 6
(6, 6) 6 6 12 7/2 7/2 7/2 6 + 7/2 12/2 6

Table 3.2: Various conditional expectations on two independent dice


CHAPTER 3. RANDOM VARIABLES 47

For unordered search, putting x in a uniform random array location


makes any deterministic algorithm take at least (n + 1)/2 probes on average.
So randomized algorithms take at least (n + 1)/2 probes as well.

3.6.2 Geometric random variables


Suppose that we are running a Las Vegas algorithm that takes a fixed
amount of time T , but succeeds only with probability p (which we take to
be independent of the outcome of any other run of the algorithm). If the
algorithm fails, we run it again. How long does it take on average to get the
algorithm to work?
We can reduce the problem to computing E [T X] = T E [X], where X
is the number of times the algorithm runs. The probability that X = n is
exactly (1−p)n−1 p, because we need to get n−1 failures with probability 1−p
each followed by a single success with probability p, and by assumption all of
these probabilities are independent. A variable with this kind of distribution
is called a geometric random variable. We saw a special case of this
distribution earlier (§2.3.3.1) when we were looking at how many flips it
would take on average to get the first tails from a fair coin coin (in that case,
p was 1/2).
Using conditional expectation, it’s straightforward to compute E [X]. Let
A be the event that the algorithm succeeds on the first run, i.e., then event
[X = 1]. Then
h i h i
E [X] = E [X | A] Pr [A] + E X Ā Pr Ā
h i
= 1 · p + E X Ā · (1 − p).
h i
The tricky part here is to evaluate E X Ā . Intuitively, if we don’t succeed
the first time, we’ve wasted
h onei step and are back where we started, so it
should be the case that E X Ā = 1 + E [X]. If we want to be really careful,
CHAPTER 3. RANDOM VARIABLES 48

we can calculate this out formally (no sensible person would ever do this):
h i ∞
X
E X Ā = n Pr [X = n | X 6= 1]
n=1

X Pr [X = n]
= n
n=2
Pr [X 6= 1]

X (1 − p)n−1 p
= n
n=2
1−p
X∞
= n(1 − p)n−2 p
n=2
X∞
= (n + 1)(1 − p)n−1 p
n=1

X
=1+ n(1 − p)n−1 p
n=1
= 1 + E [X] .

Since we know that E [X] = p + (1 + E [X])(1 − p), a bit of algebra gives


E [X] = 1/p, which is about what we’d expect.
There are more direct ways to get the same result. If we don’t have
conditional expectation to work with, we can try computing the sum E [X] =
P∞ n−1 p directly. The easiest way to do this is probably to use
n=1 n(1 − p)
generating functions (see, for example, [GKP88, Chapter 7] or [Wil06]).
An alternative argument is given in [MU17, §2.4]; this uses the fact that
E [X] = ∞ n=1 Pr [X ≥ n], which holds when X takes on only non-negative
P

integer values.

3.6.3 Coupon collector


In the coupon collector problem, we throw balls uniformly and indepen-
dently into n bins until every bin has at least one ball. When this happens,
how many balls have we used on average?10
Let Xi be the number of balls needed to go from i − 1 nonempty bins
to i nonempty bins. It’s easy to see that X1 = 1 always. For larger i, each
time we throw a ball, it lands in an empty bin with probability n−i+1
n . This
10
The name comes from the problem of collecting coupons at random until you have all
of them. A typical algorithmic application is having a cluster of machines choose jobs to
finish at random from some list until all are done. The expected number of job executions
to complete n jobs is given exactly by the solution to the coupon collector problem.
CHAPTER 3. RANDOM VARIABLES 49

n−i+1
means that Xi has a geometric distribution with probability n , giving
n
E [Xi ] = n−i+1 from the analysis in §3.6.2.
To get the total expected number of balls, take the sum
" n # n
X X
E Xi = E [Xi ]
i=1 i=1
n
X n
=
i=1
n−i+1
n
X 1
=n
i=1
i
= nHn .

In asymptotic terms, this is Θ(n log n).

3.6.4 Hoare’s FIND


Hoare’s FIND [Hoa61b], often called QuickSelect, is an algorithm for
finding the k-th smallest element of an unsorted array that works like
QuickSort, only after partitioning the array around a random pivot we
throw away the part that doesn’t contain our target and recurse only on
the surviving piece. As with QuickSort, we’d like to compute the expected
number of comparisons used by this algorithm, on the assumption that the
cost of the comparisons dominates the rest of the costs of the algorithm.
Here the indicator-variable trick gets painful fast. It turns out to be
easier to get an upper bound by computing the expected number of elements
that are left after each split.
First, let’s analyze the pivot step. If the pivot is chosen uniformly, the
number of elements X smaller than the pivot is uniformly distributed in the
range 0 to n−1. The number of elements larger than the pivot will be n−X−1.
In the worst case, we find ourselves recursing on the large pile always, giving
a bound on the number of survivors Y of Y ≤ max(X, n − X + 1).
What is the expected value of Y ? By considering both ways the max can
go, we get

E [Y ] = E [X | X > n − X + 1] Pr [X > n − X + 1]
+ E [n − X + 1 | n − X + 1 ≥ X] Pr [n − X + 1 ≥ X] .

For both conditional expectations, we are choosing a value uniformly in


either the range d n−1 n−1
2 e to n − 1 or d 2 e + 1 to n − 1, and in either case the
CHAPTER 3. RANDOM VARIABLES 50

expectation will be equal to the average of the two endpoints by symmetry.


So we get

n/2 + n − 1 n/2 + n
E [Y ] ≤ Pr [X > n − X + 1] + Pr [n − X + 1 ≥ X]
2  2
3 1 3

= n− Pr [X > n − X + 1] + n Pr [n − X + 1 ≥ X]
4 2 4
3
≤ n.
4
Now let Xi be the number of survivors after i pivot steps. Note that
max(0, Xi − 1) gives the number of comparisons at the following pivot step,
so that ∞
P
i=0 Xi is an upper bound on the number of comparisons.
We have X0 = n, and from the preceding argument E [X1 ] ≤ (3/4)n. But
more generally, we can use the same argument to show that E [Xi+1 | Xi ] ≤
(3/4)Xi , and by induction E [Xi ] ≤ (3/4)i n. We also have that Xj = 0 for
all j ≥ n, because we lose at least one element (the pivot) at each pivoting
step. This saves us from having to deal with an infinite sum.
Using linearity of expectation,
"∞ # " n #
X X
E Xi = E Xi
i=0 i=0
n
X
= E [Xi ]
i=0
Xn
≤ (3/4)i n
i=0
≤ 4n.
Chapter 4

Basic probabilistic
inequalities

Here we’re going to look at some inequalities useful for proving properties
of randomized algorithms. These come in two flavors: inequalities involving
probabilities, which are useful for bounding the probability that something
bad happens, and inequalities involving expectations, which are used to
bound expected running times. Later, in Chapter 5, we’ll be doing both,
by looking at inequalities that show that a random variable is close to its
expectation with high probability.1

4.1 Markov’s inequality


This is the key tool for turning expectations of non-negative random variables
into (upper) bounds on probabilities. Used directly, it generally doesn’t give
very good bounds, but it can work well if we apply it to E [f (X)] for a
fast-growing function f ; for some examples, see Chebyshev’s inequality (§5.1)
or Chernoff bounds (§5.2).
1
Often, the phrase with high probability is used in algorithm analysis to mean
specifically with probability at least 1 − n−c for any fixed c. The idea is that if an algorithm
works with high probability in this sense, then the probability that it fails each time you
run it is at most n−c , which means that if you run it as a subroutine in a polynomial-
0
time algorithm that calls is at most nc times, the total probability of failure is at most
0 0
nc n−c = nc −c by the union bound. Assuming we can pick c to be much larger than c0 ,
this makes the outer algorithm also work with high probability.

51
CHAPTER 4. BASIC PROBABILISTIC INEQUALITIES 52

Markov’s inequality says that if X ≥ 0 and α > 0, then

E [X]
Pr [X ≥ α] ≤ .
α
The proof is immediate from the law of total probability (2.3.3). We have

E [X] = E [X | X ≥ α] Pr [X ≥ α] + E [X | x < α] Pr [X < α]


≥ α · Pr [X ≥ α] + 0 · Pr [X < α]
= α · Pr [X ≥ α] ;

now solve for Pr [X ≥ α].


Markov’s inequality doesn’t work in reverse. For example, consider the
following game: for each integer k > 0, with probability 2−k , I hgive youi 2k
dollars. Letting X be your payoff from the game, we have Pr X ≥ 2k =
P∞
2−k = 2−k ∞ −` = 2 . The right-hand side here is exactly what we
P
j=k `=0 2 2k
would get from Markov’s inequality if E [X] = 2. But in this case, E [X] 6= 2;
in fact, the expectation of X is given by ∞ k −k
P
k=1 2 2 , which diverges.

4.1.1 Applications
4.1.1.1 Sum of fair coins
Flip n independent fair coins, and let S be the number of heads we get. Since
E [S] = n/2, we get Pr [S = n] = 1/2. This is much larger than the actual
value 2−n , but it’s the best we can hope for if we only know E [S]: if we let
S be 0 or n with equal probability, we also get E [S] = n/2.

4.1.1.2 Randomized QuickSort


The expected running time for randomized QuickSort is O(n log n). It follows
that the probability that randomized QuickSort takes more than f (n) time is
O(n log n/f (n)). For example, the probability that it performs the maximum
n
2 = O(n2 ) comparisons is O(log n/n). (It’s possible to do much better
than this.)

4.1.1.3 Balls in bins


Suppose we toss m balls in n bins, uniformly and independently. What
is the probability that some particular bin contains at least k balls? The
probability that a particular ball lands in a particular bin is 1/n, so the
expected number of balls in the bin is m/n. This gives a bound of m/nk
CHAPTER 4. BASIC PROBABILISTIC INEQUALITIES 53

that a particular bin contains k or more balls. Unfortunately this is not a


very good bound.

4.2 Union bound (Boole’s inequality)


The union bound or Boole’s inequality says that for any countable
collection of events {Ai },
h[ i X
Pr Ai ≤ Pr [Ai ] . (4.2.1)

Combining Markov’s inequality with linearity of expectation and indicator


variables gives a succinct proof of the union bound:
h[ i hX i
Pr Ai = Pr 1Ai ≥ 1
hX i
≤E 1Ai
X
= E [1Ai ]
X
= Pr [Ai ] .

Note that for this to work for infinitely many events we need to use the fact
that 1Ai is non-negative.
If we prefer to avoid any issues with infinite sums of expectations, the
direct way to prove this is to replace Ai with Bi = Ai \ i−1
S
S S j=1 Ai . Then
Ai = Bi , but since the Bi are disjoint and each Bi is a subset of the
corresponding Ai , we get Pr [ Ai ] = Pr [ Bi ] = Pr [Bi ] ≤ Pr [Ai ].
S S P P

The typical use of the union bound is to show that if an algorithm can
fail only if various improbable events occur, then the probability of failure is
no greater than the sum of the probabilities of these events. This reduces
the problem of showing that an algorithm works with probability 1 −  to
constructing an error budget that divides the  probability of failure among
all the bad outcomes.

4.2.1 Example: Balls in bins


Suppose we toss n balls uniformly and independently into n bins. What
high-probability bound can we get on the maximum number of balls in any
one bin?2
2
Algorithmic version: we insert n elements into a hash table with n positions using a
random hash function. What is the maximum number of elements in any one position?
CHAPTER 4. BASIC PROBABILISTIC INEQUALITIES 54

Consider all nk sets S of k balls. If we get at least k balls in some bin,




then one of these sets must all land in the same bin. Call the event that all
balls in S choose the same bin AS . The probability that AS occurs is exactly
n−k+1 .
Using the union bound, we get
" #
[
Pr [some bin gets at least k balls] = Pr AS
S
X
≤ Pr [AS ]
S
!
n −k+1
= n
k
nk −k+1
≤ n
k!
n
= .
k!
If we want this probability to√ be low, we should choose k so that k!  n.
Stirling’s formula says that k! ≥ 2πk(k/e)k ≥ (k/e)k , which gives ln(k!) ≥
k(ln k − 1). If we set k = c ln n/ ln ln n, we get
c ln n
ln(k!) ≥ (ln c + ln ln n − ln ln ln n − 1)
ln ln n
≥ c ln n.

when n is sufficiently large.


It follows that the bound n/k! in this case is less than n/ exp(c ln n) =
n · n−c = n1−c . For suitable choice of c we get a high probability that every
bin gets at most O(log n/ log log n) balls.

4.3 Jensen’s inequality


This is mostly useful if we can calculate E [X] easily for some X, but what
we really care about is some other random variable Y = f (X).
Jensen’s inequality applies when f is a convex function, which means
that for any x, y, and 0 ≤ µ ≤ 1, f (λx + (1 − λ)y) ≤ λf (x) + (1 − λ)f (y).
Geometrically, this means that the line segment between any two points on the
graph of f never goes below f ; i.e., that the set of points {(x, y) | y ≥ f (x)}
is convex. If we want to show that a continuous function f is convex, it’s
enough to show that that f 2 ≤ f (x)+f
x+y
2
(y)
for all x and y (in effect, we
CHAPTER 4. BASIC PROBABILISTIC INEQUALITIES 55

only need to prove it for the case λ = 1/2). If f is twice-differentiable, an


even easier way is to show that f 00 (x) ≥ 0 for all x.
The inequality says that if X is a random variable and f is convex then

f (E [X]) ≤ E [f (X)] . (4.3.1)

Alternatively, if f is concave (which means that f (λx + (1 − λ)y) ≥


λf (x) + (1 − λ)f (y), or equivalently that −f is convex), the reverse inequality
holds:

f (E [X]) ≥ E [f (X)] . (4.3.2)

In both cases, the direction of Jensen’s inequality matches the direction


of the inequality in the definition of convexity or concavity. This is not
surprising because convexity or or concavity is just Jensen’s inequality for
the random variable X that equals x with probability λ and y with probability
1−λ. Jensen’s inequality just says that this continues to work for any random
variable for which the expectations exist.

4.3.1 Proof
Here is a proof for the case that f is convex and differentiable. The idea is
that if f is convex, then it lies above the tangent line at E [X]. So we can
define a linear function g that represents this tangent line, and get, for all x:

f (x) ≥ g(x) = f (E [X]) + (x − E [X])f 0 (E [X]). (4.3.3)

But then

E [f (X)] ≥ E [g(X)]
= E f (E [X]) + (X − E [X])f 0 (E [X])
 

= f (E [X]) + E [(X − E [X])] f 0 (E [X])


= f (E [X]) + (E [X] − E [X])f 0 (E [X])
= f (E [X]).

Figure 4.1 shows what this looks like for a particular convex f .
This is pretty much all linearity of expectation in action: E [X], f (E [X]),
and f 0 (E [X]) are all constants, so we can pull them out of expectations
whenever we see them.
The proof of the general case is similar, but for a non-differentiable convex
function it takes a bit more work to show that the bounding linear function
g exists.
CHAPTER 4. BASIC PROBABILISTIC INEQUALITIES 56

f (x)

E [f (X)]
g(x)

f (E [X])

E [X]

Figure 4.1: Proof of Jensen’s inequality

4.3.2 Applications
4.3.2.1 Fair coins: lower bound
Suppose we flip n fair coins, and we want to get a lower bound on E X 2 ,
 

where X is the number of heads. The function f : x 7→ x2 is convex (take its


2
second derivative), so (4.3.1) gives E X 2 ≥ (E [X])2 = n4 .
 
2
The actual value for E X 2 is n4 + n4 , which can be found directly using
 

generating functions3 or less directly using variance, which we will encounter


in §5.1. This is pretty close to the lower bound we got out of Jensen’s
inequality, but we can’t count on this happening in general.

4.3.2.2 Fair coins: upper bound


For an upper bound, we can choose a concave f . For example, if X is as in
the previous example, E [lg X] ≤ lg E [X] = lg n2 = lg n − 1. This is probably
pretty close to the exact value, as we will see later that X will almost always
be within a factor of 1 + o(1) of n/2. It’s not a terribly useful upper bound,
because if we use it with (say) Markov’s inequality, the most we can prove
is that Pr [X = n] = Pr [lg X = lg n] ≤ lglgn−1 1
n = 1 − lg n , which is an even
 
Here’s how: The probability generating function for X is F (z) = E z k =
3

z Pr [X = k] = 2−n (1 + z)n . Then zF 0 (z) = 2−n nz(1 + z)n−1 =


P k
kz k Pr [X = k].
P
k
−n n−1 −n
Taking the
Pderivative of this a second time gives 2 n(1 + z) + 2 n(n −1)z(1  +
z)n−2 = k
k 2 k−1
z Pr [X = k]. Evaluate this monstrosity at z = 1 to get E X 2
=
2n+n2 −n
k2 Pr [X = k] = 2−n n2n−1 + 2−n n(n − 1)2n−2 = n/2 + n(n − 1)/4 =
P
k 4
=
2
n /4 + n/4.
CHAPTER 4. BASIC PROBABILISTIC INEQUALITIES 57

worse bound than the 1/2 we can get from applying Markov’s inequality to
X directly.

4.3.2.3 Sifters
Here’s an example of Jensen’s inequality in action in the analysis of an actual
distributed algorithm. For some problems in distributed computing, it’s
useful to reduce coordinating a large number of processes to coordinating
a smaller number. A sifter [AA11] is a randomized mechanism for an
asynchronous shared-memory system that sorts the processes into “winners”
and “losers,” guaranteeing that there is at least one winner. The goal is to
make the expected number of winners as small as possible. The problem
is tricky, because processes can only communicate by reading and writing
shared variables, and an adversary gets to choose which processes participate
and fix the schedule of when each of these processes perform their operations.
The current best known sifter is due to Giakkoupis and Woelfel [GW12].
For n processes, it uses an array A of dlg ne bits, each of which can be read
or written by any of the processes. When a process executes the sifter, it
chooses a random index r ∈ 1 . . . dlg ne with probability 2−r−1 (this doesn’t
exactly sum to 1, so the excess probability gets added to r = dlg ne). The
process then writes a 1 to A[r] and reads A[r + 1]. If it sees a 0 in its read
(or chooses r = dlg ne), it wins; otherwise it loses.
This works as a sifter, because no matter how many processes participate,
some process chooses a value of r at least as large as any other process’s
value, and this process wins. To bound the expected number of winners, take
the sum over all r over the random variable Wr representing the winners
who chose this particular value r. A process that chooses r wins if it carries
out its read operation before any process writes r + 1. If the adversary wants
to maximize the number of winners, it should let each process read as soon
as possible; this effectively means that a process that choose r wins if no
process previously chooses r + 1. Since r is twice as likely to be chosen as
r + 1, conditioning on a process picking r or r + 1, there is only a 1/3 chance
that it chooses r + 1. So at most 1/(1/3) − 1 = 2 = O(1) process on average
choose r before some process chooses r + 1. (A simpler argument shows that
the expected number of processes that win because they choose r = dlg ne is
at most 2 as well.)
Summing Wr ≤ 2 over all r gives at most 2dlg ne winners on average.
Furthermore, if k < n processes participate, essentially the same analysis
shows that only 2dlg ke processes win on average. So this is a pretty effective
tool for getting rid of excess processes.
CHAPTER 4. BASIC PROBABILISTIC INEQUALITIES 58

But it gets better. Suppose that we take the winners of one sifter and
feed them into a second sifter. Let Xk be the number of processes left after
k sifters. We have that X0 = n and E [X1 ] ≤ 2dlg ne, but what can we
say about E [X2 ]? We can calculate E [X2 ] = E [E [X2 | X1 ]] ≤ 2dlg X1 e.
Unfortunately, the ceiling means that 2dlg xe is not a concave function,
but f (x) = 2(lg x + 1) ≥ 2dlg xe is. So E [X2 ] ≤ f (f (n)), and in general
E [Xi ] ≤ f (i) (n), where f (i) is the i-fold composition of f . All the extra
constants obscure what is going on a bit, but with a little bit of algebra it is
not too hard to show that f (i) (n) = O(1) for i = O(log∗ n).4 So this gets rid
of all but a constant number of processes very quickly.

4
The log∗ function counts how many times you need to hit n with lg to reduce it to
one or less. So log∗ 1 = 0, log∗ 2 = 1, log∗ 4 = 2, log∗ 16 = 3, log∗ 65536 = 4, log∗ 265536 = 5,
and after that it starts getting silly.
Chapter 5

Concentration bounds

If we really want to get tight bounds on a random variable X, the trick will
turn out to be picking some non-negative function f (X) where (a) we can
calculate E [f (X)], and (b) f grows fast enough that merely large values of
X produce huge values of f (X), allowing us to get small probability bounds
by applying Markov’s inequality to f (X). This approach is often used to
show that X lies close to E [X] with reasonably high probability, what is
known as a concentration bound.
Typically concentration bounds are applied to sums of random variables,
which may or may not be fully independent. Which bound you may want to
use often depends on the structure of your sum. A quick summary of the
bounds in this chapter is given in Table 5.1. The rule of thumb is to use
Chernoff bounds (§5.2) if you have a sum of independent 0–1 random variables;
the Azuma-Hoeffding inequality (§5.3) if you have bounded variables with a
more complicated distribution that may be less independent; and Chebyshev’s
inequality (§5.1) if nothing else works but you can somehow compute the
variance of your sum (e.g., if the Xi are independent or have easily computed
covariance). In the case of Chernoff bounds, you will almost always end up
using one of the weaker but cleaner versions in §5.2.2 rather than the general
version in §5.2.1.
If none of these bounds work for your particular application, there are
many more out there. See for example the textbook by Dubhashi and
Panconesi [DP09].

59
CHAPTER 5. CONCENTRATION BOUNDS 60

 E[S]

Chernoff Xi ∈ {0, 1}, independent Pr [S ≥ (1 + δ) E [S]] ≤ (1+δ)1+δ

 
t2
Azuma-Hoeffding |Xi | ≤ ci , martingale Pr [S ≥ t] ≤ exp − 2 P c2i

Var[S]
Chebyshev Pr [|S − E [S]| ≥ α] ≤ α2

P
Table 5.1: Concentration bounds for S = Xi (strongest to weakest)

5.1 Chebyshev’s inequality


Chebyshev’s inequality allows us to show that a random variable is close
to its mean, byapplying Markov’s inequality to the variance of X, defined
2

as Var [X] = E (X − E [X]) . It’s a fairly weak concentration bound, that is
most useful when X is a sum of random variables with limited independence.
Using Markov’s inequality, calculate
h i
Pr [|X − E [X]| ≥ α] = Pr (X − E [X])2 ≥ α2
E (X − E [X])2
 

α2
Var [X]
= . (5.1.1)
α2

5.1.1 Computing variance


At this point it would be reasonable to ask why we are going through
Var [X] = E (X − E [X])2 rather than just using E [|X − E [X]|]. The
reason is that Var [X] is usually easier to compute, especially if X is a sum.
In this section, we’ll give some examples of computing variance, including
for various standard random variables that come up often in randomized
algorithms.

5.1.1.1 Alternative formula


The first step is to give an alternative formula for the variance that is more
convenient in some cases.
CHAPTER 5. CONCENTRATION BOUNDS 61

Expand
h i h i
E (X − E [X])2 = E X 2 − 2X · E [X] + (E [X])2
h i
= E X 2 − 2 E [X] · E [X] + (E [X])2
h i
= E X 2 − (E [X])2 . (5.1.2)

This formula is easier to use if you are estimating the variance from a
sequence of samples; by tracking x2i and xi , you can estimate E X 2
P P  

and E [X] in a single pass, without having to estimate E [X] first and then go
back for a second pass to calculate (xi − E [X])2 for each sample. We won’t
use this particular application much, but this explains why the formula is
popular with statisticians.

5.1.1.2 Variance of a Bernoulli random variable


Recall that a Bernoulli random variable is 1 with probability p and 0 with
probability q = 1 − p; in particular, any indicator variable is a Bernoulli
random variable.
The variance of a Bernoulli random variable is easily calculated from
(5.1.2):
h i
Var [X] = E X 2 − (E [X])2
= p − p2
= p(1 − p)
= pq.

5.1.1.3 Variance of a sum


P
If S = i Xi , then we can calculate
 !2  " #!2
X X
Var [S] = E  Xi − E Xi
i i
 
XX XX
= E Xi Xj  − E [Xi ] E [Xj ]
i j i j
XX
= (E [Xi Xj ] − E [Xi ] E [Xj ]) .
i j
CHAPTER 5. CONCENTRATION BOUNDS 62

For any two random variables X and Y , the quantity E [XY ]−E [X] E [Y ]
is called the covariance of X and Y , written Cov [X, Y ]. If we take the
covariance of a variable and itself, covariance becomes variance: Cov [X, X] =
Var [X].
We can use Cov [X, Y ] to rewrite the above expansion as
" #
X X
Var Xi = Cov [Xi , Xj ] (5.1.3)
i i,j
X X
= Var [Xi ] + Cov [Xi , Xj ] (5.1.4)
i i6=j
X X
= Var [Xi ] + 2 Cov [Xi , Xj ] (5.1.5)
i i<j

Note that Cov [X, Y ] = 0 when X and Y are independent; this makes
Chebyshev’s inequality particularly useful for pairwise-independent ran-
dom variables, because then we can just sum up the variances of the individual
variables.
P
A typical application is when we have a sum S = Xi of non-negative
random variables with small covariance; here applying Chebyshev’s inequality
to S can often be used to show that S is not likely to be much smaller than
E [S], which can be handy if we want to show that some lower bound holds
on S with some probability. This complements Markov’s inequality, which
can only be used to get upper bounds.
Pn
For example, suppose S = i=1 Xi , where the Xi are independent
Bernoulli random variables with E [Xi ] = p for all i. Then E [S] = np, and
P
Var [S] = i Var [Xi ] = npq (because the Xi are independent). Chebyshev’s
inequality then says
npq
Pr [|S − E [S]| ≥ α] ≤ .
α2
The highest variance is when p = 1/2. In this case, he probability that

S is more than β n away from its expected value n/2 is bounded by 4β1 2 .
We’ll see better bounds on this problem later, but this may already be good
enough for many purposes.
More generally, the approach of bounding S from below by estimating
 2
E [S] and either E S or Var [S] is known as the second-moment method.
In some cases, tighter bounds can be obtained by more careful analysis.
CHAPTER 5. CONCENTRATION BOUNDS 63

5.1.1.4 Variance of a geometric random variable


Let X be a geometric random variable with parameter p as defined in
§3.6.2, so that X takes on the values 1, 2, . . . and Pr [X = n] = q n−1 p, where
q = 1 − p as usual. What is Var [X]?
We know that E [X] = 1/p, so (E [X])2 = 1/p2 . Computing E X 2 is
 

trickier. Rather than do this directly from the definition of expectation, we


can exploit the memorylessness of geometric random variables to get it using
conditional expectations, just like we did for E [X] in §3.6.2.
Conditioning on the two possible outcomes of the first trial, we have
h i h i
E X2 = p + q E X2 X > 1 . (5.1.6)

We now argue that E X 2 X > 1 = E (X + 1)2 . The intuition is


   

that once we have flipped one coin the wrong way, we are back where
we started, except that now we have to add that extra coin to X. More
Pr[X 2 =n] n−1
formally, we have, for n > 1, Pr X 2 = n X > 1 = Pr[X>1] = q q p =
 

q n−2 p = Pr [X = n − 1] = Pr [X + 1 = n]. So we get the same probability


mass function for X conditioned on X > 1 as for X + 1 with no conditioning.
Applying this observation to the right-hand side of (5.1.6) gives
h i h i
E X 2 = p + q E (X + 1)2
 h i 
= p + q E X 2 + 2 E [X] + 1
2qh i
= p + q E X2 + +q
p
h i 2q
= 1 + q E X2 + .
p
A bit of algebra turns this into
h i 1 + 2q/p
E X2 =
1−q
1 + 2q/p
=
p
p + 2q
=
p2
2−p
= .
p2
CHAPTER 5. CONCENTRATION BOUNDS 64

Now subtract (E [X])2 = 1


p2
to get
1−p q
Var [X] = 2
= 2. (5.1.7)
p p
By itself, this doesn’t give very good bounds on X. For example, if we
want to bound the probability that X = 1, we get
1
 
Pr [X = 1] = Pr X − E [X] = 1 −
p
1
 
≤ Pr |X − E [X]| ≥ − 1
p
Var [X]
≤ 2
1
p − 1
q/p2
= 2
1
−1
p
q
=
(1 − p)2
1
= .
q
Since 1q ≥ 1, we could have gotten this bound with much less work.
The other direction is not much better. We can easily calculate that
Pr [X ≥ n] is exactly q n−1 (because this corresponds to flipping n coins
the wrong way, no matter what happens with subsequent coins). Using
Chebyshev’s inequality gives
1 1
 
Pr [X ≥ n] ≤ Pr X − ≥n−
p p
q/p 2
≤ 2
n − p1
q
= .
(np − 1)2
This at least has the advantage of dropping below 1 when n gets large
enough, but it’s only polynomial in n while the true value is exponential.
Where this might be useful is in analyzing the sum of a bunch of geometric
random variables, as occurs in the Coupon Collector problem discussed in
§3.6.3.1 Letting Xi be the number of balls to take us from i − 1 to i empty
1
We are following here a similar analysis in [MU17, §3.3.1].
CHAPTER 5. CONCENTRATION BOUNDS 65

bins, we have previously argued that Xi has a geometric distribution with


p = n−i−1
n , so
, 2
i−1 n−i−1
Var [Xi ] =
n n
i−1
=n ,
(n − i − 1)2

and
" n # n
X X
Var Xi = Var [Xi ]
i=1 i=1
n
X i−1
= n .
i=1 (n − i − 1)2

Having the numerator go up while the denominator goes down makes


this a rather unpleasant sum to try to solve directly. So we will follow the
lead of Mitzenmacher and Upfal and bound the numerator by n, giving
" n # n
X X n
Var Xi ≤ n
i=1 i=1 (n − i − 1)2
n
X 1
= n2
i=1
i2

X 1
≤ n2
i=1
i2
π2
= n2 .
6
2
The fact that ∞ 1 π
P
i=1 i2 converges to 6 is not trivial to prove, and was first
shown by Leonhard Euler in 1735 some ninety years after the question was
first proposed.2 But it’s easy to show that the series converges to something,
so even if we didn’t have Euler’s help, we’d know that the variance is O(n2 ).
Since the expected value of the sum is Θ(n log n), this tells us that we
are likely to see a total waiting time reasonably close to this; with at least
2
See http://en.wikipedia.org/wiki/Basel_Problem for a history, or Euler’s original
paper [Eul68], available at http://eulerarchive.maa.org/docs/originals/E352.pdf,
for the actual proof in the full glory of its original 18th-century typesetting. Curiously,
though Euler announced his result in 1735, he didn’t submit the journal version until 1749,
and it didn’t see print until 1768. Things moved more slowly in those days.
CHAPTER 5. CONCENTRATION BOUNDS 66

a constant probability, it will be within Θ(n) of the expectation. In fact,


the distribution is much more sharply concentrated (see [MU17, §5.4.1] or
[MR95, §3.6.3]), but this bound at least gives us something.

5.1.2 More examples


Here are some more examples of Chebyshev’s inequality in action. Some
of these repeat examples for which we previously got crummier bounds in
§4.1.1.

5.1.2.1 Flipping coins


Let X be the sum of n independent fair coins. Let Xi be the indicator
variable for the event that the i-th coin comes up heads. Then Var [Xi ] = 1/4
and Var [X] = Var [Xi ] = n/4. Chebyshev’s inequality gives Pr [X = n] ≤
P
n/4 1
Pr [|X − n/2| ≥ n/2] ≤ (n/2) 2 = n . This is still not very good, but it’s
getting better. It’s also about the best we can do given only the assumption
of pairwise independence.
To see this, let n = 2m − 1 for some m, and let Y1 . . . Ym be independent,
fair 0–1 random variables. For each non-empty subset S of {1 . . . m}, let
XS be the exclusive OR of all Yi for i ∈ S. Then (a) the XS are pairwise
independent; (b) each XS has variance 1/4; and thus (c) the same Chebyshev’s
P
inequality analysis for independent coin flips above applies to X = S XS ,
Var[S] n/4 1
giving Pr [|X − n/2| = n/2] ≤ (n/2) 2 = n2 /4 = n . In this case it is not
actually possible for X to equal n, but we can have X = 0 if all the Yi are 0,
which occurs with probability 2−m = n+1 1
. So the Chebyshev’s inequality is
almost tight in this case.

5.1.2.2 Balls in bins


Let Xi be the indicator that the i-th of m balls lands in a particular bin.
Then E [Xi ] = 1/n, giving E [ Xi ] = m/n, and Var [Xi ] = 1/n − 1/n2 ,
P

giving Var [ Xi ] = m/n − m/n2 . So the probability that we get k + m/n


P

or more balls in a particular bin is at most (m/n − m/n2 )/k 2 < m/nk 2 , and
applying the union bound, the probability that we get k + m/n or more balls
in any of the n bins is less than m/k 2 . Setting this equal to  andpsolving for
k gives a probability of at most  of getting more than m/n + m/ balls
in any of the bins. This is not as good a bound as we will be able to prove
later, but it’s at least non-trivial.
CHAPTER 5. CONCENTRATION BOUNDS 67

5.1.2.3 Lazy select


This example comes from [MR95, §3.3]; essentially the same example, spe-
cialized to finding the median, also appears in [MU17, §3.5].3
We want to find the k-th smallest element S(k) of a set S of size n. (The
parentheses around the index indicate that we are considering the sorted
version of the set S(1) < S(2) · · · < S(n) .) The idea is to:

1. Sample a multiset R of n3/4 elements of S with replacement and sort


them. This takes O(n3/4 log n3/4 ) = o(n) comparisons so far.

2. Use our sample to find an interval that is likely to contain S(k) . The
idea is to pick indices ` = (k − n3/4 )n−1/4 and r = (k + n3/4 )n−1/4 and
use R(`) and R(r) as endpoints (we are omitting some floors and maxes
here to simplify the notation; for a more rigorous presentation see
[MR95]). The hope is that the interval P = [R(`) , R(r) ] in S will both
contain S(k) , and be small, with |P | ≤ 4n3/4 + 2. We can compute the
elements of P in 2n comparisons exactly by comparing every element
with both R(`) and R(r) .

3. If both these conditions hold, sort P (o(n) comparisons) and return


S(k) . If not, try again.
We want to get a bound on how likely it is that P either misses S(k) or
is too big.
For any fixed k, the probability that one sample in R is less than or
equal to S(k) is exactly k/n, so the expected number X of samples ≤ S(k)
is exactly kn−1/4 . The variance on X can be computed by summing the
variances of the indicator variables that each sample is ≤ S(k) , which gives
a bound Var [X] =hn3/4 ((k/n)(1 − k/n)) 3/4
i ≤ n /4. Applying Chebyshev’s

inequality gives Pr X − kn−1/4 ≥ n ≤ n3/4 /4n = n−1/4 /4.
Now let’s look at the probability that P misses S(k) because R(`) is too

big, where ` = kn−1/4 − n. This is
h i h √ i
Pr R(`) > S(k) = Pr X < kn−1/4 − n
≤ n−1/4 /4.
3
The history is that Motwani and Raghavan adapted this algorithm from a similar
algorithm by Floyd and Rivest [FR75]. Mitzenmacher and Upfal give a version that also
includes the adaptations appearing Motwani and Raghavan, although they don’t say where
they got it from, and it may be that both textbook versions come from a common folklore
source.
CHAPTER 5. CONCENTRATION BOUNDS 68

(with the caveat that we are being sloppy about round-off errors).
Similarly,
h i h √ i
Pr R(h) < S(k) = Pr X > kn−1/4 + n
≤ n−1/4 /4.

So the total probability that P misses S(k) is at most n−1/4 /2.


Now we want to show that |P | is small. We will do so by showing that
it is likely that R(`) ≥ S(k−2n3/4 ) and R(h) ≤ S(k+2n3/4 ) . Let X` be the
number of samples in R that are ≤ S(k−2n3/4 ) and Xr be the number of

samples in R that are ≤ S(k+2n3/4 ) . Then we have E [X` ] = kn−1/4 − 2 n

and E [Xr ] = kn−1/4 + 2 n, and Var [Xl ] and Var [Xr ] are both bounded by
n3/4 /4.
We can now compute
h i √
Pr R(l) < S(k−2n3/4 ) = P r[Xl > kn−1/4 − n] < n−1/4 /4

by the same Chebyshev’s inequality argument


h as before,
i and get the sym-
metric bound on the other side for Pr R(r) > S(k+2n3/4 ) . This gives a total
bound of n−1/4 /2 that P is too big, for a bound of n−1/4 = o(n) that the
algorithm fails on its first attempt.
The total expected number of comparisons is thus given by T (n) =
2n + o(n) + O(n−1/4 T (n)) = 2n + o(n).

5.2 Chernoff bounds


To get really tight bounds, we apply Markov’s inequality to exp(αS), where
P
S = i Xi . This works best when the Xi are independent: if this is
the case, so are the variables exp(αXi ), and so we can easily calculate
Q Q
E [exp(αS)] = E [ i exp(αXi )] = i E [exp(αXi )].
The quantity E [exp(αS)], treated as a function of α, is called the
moment h generating
i function of S, because it expands formally into
P∞
E X k αk , the exponential generating function for the series of
k=0 k! h i
k-th moments E X k . Note that it may not converge for all S and α;4 we
will be careful to choose α for which it does converge and for which Markov’s
inequality gives us good bounds.
 
4
For example, the moment generating function for our earlier bad X with Pr X = 2k =
−k
P −k αk
2 is equal to k
2 e , which diverges unless eα /2 < 1.
CHAPTER 5. CONCENTRATION BOUNDS 69

5.2.1 The classic Chernoff bound


The basic Chernoff bound applies to sums of independent 0–1 random vari-
ables, which need not be identically distributed. For identically distributed
random variables, the sum has a binomial distribution, which we can either
compute exactly or bound more tightly using approximations specific to
binomial tails; for sums of bounded random variables that aren’t necessarily
0–1, we can use Hoeffding’s inequality instead (see §5.3).
Let each Xi for i = 1 . . . n be a 0–1 random variable with expectation
pi , so that E [S] = µ = i pi . The plan is to show Pr [S ≥ (1 +hδ)µ]iis small
P

when δ and µ are large, by applying Markov’s inequality to E eαS , where


α will be chosen to make the bound as tight has possible
i for some specific δ.
The first step is to get an upper bound on E eαS .
Compute
h i h P i
E eαS = E eα Xi

Y h i
= E eαXi
i
Y 
= pi eα + (1 − pi )e0
i
Y
= (pi eα + 1 − pi )
i
Y
= (1 + (eα − 1) pi )
i
Y α −1)p
≤ e(e i

i
P
(eα −1) i pi
=e
α −1)µ
= e(e .

The sneaky inequality step in the middle uses the fact that (1 + x) ≤ ex
for all x, which itself is one of the most useful inequalities you can memorize.5
What’s nice about this derivation is that at the end, the pi have vanished.
We don’t care what random variables we started with or how many of them
there were, but only about their expected sumhµ. i
Now that we have an upper bound on E eαS , we can throw it into
5
For a proof of this inequality, observe that the function f (x) = ex − (1 + x) has the
derivative ex − 1, which is positive for x > 0 and negative for x < 0. It follows that x = 1
is the unique minimum of f , at which f (1) = 0.
CHAPTER 5. CONCENTRATION BOUNDS 70

Markov’s inequality to get the bound we really want:


h i
Pr [S ≥ (1 + δ)µ] = Pr eαS ≥ eα(1+δ)µ
h i
E eαS

eα(1+δ)µ
α
e(e −1)µ
≤ α(1+δ)µ
e !µ
α
ee −1
=
eα(1+δ)
 α −1−α(1+δ)

= ee .

We now choose α to minimize the base in the last expression, by minimiz-


ing its exponent eα − 1 − α(1 + δ). Setting the derivative of this expression
with respect to α to zero gives eα = (1 + δ) or α = ln(1 + δ); luckily, this
value of α is indeed greater than 0 as we have been assuming. Plugging this
value in gives
 µ
Pr [S ≥ (1 + δ)µ] ≤ e(1+δ)−1−(1+δ) ln(1+δ)


= . (5.2.1)
(1 + δ)1+δ

The base of this rather atrocious quantity is e0 /11 = 1 at δ = 0, and its


derivative is negative for δ ≥ 0 (the easiest way to show this is to substitute
δ = x − 1 first). So the bound is never greater than 1 and is both decreasing
and less than 1 as soon as δ > 0. We also have that the bound is exponential
in µ for any fixed δ.
If we look at the shape of the base as a function of δ, we can observe that
when δ is very large, we can replace (1 + δ)1+δ with δ δ without changing the
bound much (and to the extent that we change it, it’s an increase, so it still
δ
works as a bound). This turns the base into δeδ = (e/δ)δ = 1/(δ/e)δ . This is

pretty close to Stirling’s formula for 1/δ! (there is a 2πδ factor missing).
For very small δ, we have that 1 + δ ≈ eδ , so the base becomes approxi-
eδ 2
mately eδ(1+δ) = e−δ . This approximation goes in the wrong direction (it’s
smaller than the actual value) but with some fudging we can show bounds
2
of the form e−µδ /c for various constants c, as long as δ is not too big.
CHAPTER 5. CONCENTRATION BOUNDS 71

5.2.2 Easier variants


The full Chernoff bound can be difficult to work with, especially since it’s
hard to invert (5.2.1) to find a good δ that gives a particular  bound.
Fortunately, there are approximate variants that substitute a weaker but less
intimidating bound. Some of the more useful are:
• For 0 ≤ δ ≤ 1.81,
2 /3
Pr [X ≥ (1 + δ)µ] ≤ e−µδ . (5.2.2)
(The actual upper limit is slightly higher.) Useful for small val-
ues of δ, especially because the bound can be inverted: ifpwe want
Pr [X ≥ (1 + δ)µ] ≤ exp(−µδ 2 /3) ≤ , we can use any δ with 3 ln(1/)/µ ≤
δ ≤ 1.81.
The proof of the approximate bound is to show that, in the given
range, eδ /(1 + δ)1+δ ≤ exp(−δ 2 /3). This is easiest to do numerically;
a somewhat more formal argument that the bound holds in the range
0 ≤ δ ≤ 1 can be found in [MU17, Theorem 4.4].
• For 0 ≤ δ ≤ 4.11,
2 /4
Pr [X ≥ (1 + δ)µ] ≤ e−µδ . (5.2.3)
This is a slightly weaker bound than the previous
p that holds over a
larger range. It gives Pr [X ≥ (1 + δ)µ] ≤  if 4 ln(1/)/µ ≤ δ ≤ 4.11.
Note that the version given on page 72 of [MR95] is not correct; it
claims that the bound holds up to δ = 2e − 1 ≈ 4.44, but it fails
somewhat short of this value.
• For R ≥ 2eµ,
Pr [X ≥ R] ≤ 2−R . (5.2.4)
Sometimes the assumption is replaced with the stronger R ≥ 6µ (this
is the version given in [MU17, Theorem 4.4], for example); one can also
verify numerically that R ≥ 5µ (i.e., δ ≥ 4) is enough. The proof of the
2eµ bound is that eδ /(1+δ)(1+δ) < e1+δ /(1+δ)(1+δ) = (e/(1+δ))1+δ ≤
2−(1+δ) when e/(1 + δ) ≤ 1/2 or δ ≥ 2e − 1. Raising this to µ gives
Pr [X ≥ (1 + δ)µ] ≤ 2−(1+δ)µ for δ ≥ 2e − 1. Now substitute R for
(1 + δ)µ (giving R ≥ 2eµ) to get the full result. Inverting this one gives
Pr [X ≥ R] ≤  when R ≥ min(2eµ, lg(1/)).
Figure 5.1 shows the relation between the various bounds, in the region
where they cross each other.
CHAPTER 5. CONCENTRATION BOUNDS 72

2−R/µ
δ ≤ 4.11+


(1+δ)(1+δ)
δ ≤ 1.81+ R/µ ≥ 4.32−

2 /4
e−δ
2 /3 e−δ

Figure 5.1: Comparison of Chernoff bound variants with exponent µ omitted,


plotted in logarithmic scale relative to the standard bound. Each bound is
valid only in in the region where it exceeds eδ /(1 + δ)1+δ .

5.2.3 Lower bound version


We can also use Chernoff bounds to show that a sum of independent 0–1
random variables isn’t too small. The essential idea is to repeat the up-
per bound argument with a negative value of α, which makes eα(1−δ)µ an
increasing function in δ. The resulting bound is:

e−δ
Pr [S ≤ (1 − δ)µ] ≤ . (5.2.5)
(1 − δ)1−δ

A simpler but weaker version of this bound is


2 /2
Pr [S ≤ (1 − δ)µ] ≤ e−µδ . (5.2.6)

Both bounds hold for all δ with 0 ≤ δ ≤ 1.

5.2.4 Two-sided version


If we combine (5.2.2) with (5.2.6), we get
2 /3
Pr [|S − µ| ≥ δµ] ≤ 2e−µδ , (5.2.7)

for 0 ≤ δ ≤ 1.81.
Suppose that weqwant this bound to be less than . Then we need
2
2e /3 ≤  or δ ≥ 3 ln(2/)
−δ
µ . Setting δ to exactly this quantity, (5.2.7)
CHAPTER 5. CONCENTRATION BOUNDS 73

becomes  q 
Pr |S − µ| ≥ 3µ ln(2/) ≤ , (5.2.8)

provided  ≥ 2e−µ/3 .
For asymptotic purposes, we can omit the constants, giving

Lemma 5.2.1. Let S be a sum of independent 0–1  E [S] = µ.


variables with
p
Then for any 0 <  ≤ 2e −µ/3 , S lies within O µ log(1/) of µ, with
probability at least 1 − .

5.2.5 What if we only have a bound on E [S]?


P
For some applications, we may not know E [S] = E [Xi ] exactly, but have
only an upper bound E [S] ≤ µ. It is not immediately obvious that we can
use this upper bound in place of the actual value of E [S] when computing
an upper tail bound on S, because µ appears in the exponent of all of
our bounds, and it is not clear that the corresponding reduction in δ fully
compensates for this. However, there is a simple argument that shows that
all of these bounds hold even if we overestimate E [S].
Consider a sum S of independent 0-1 random variables with E [S] =
0
µ ≤ µ. We can turn this into a sum of independent 0-1 random variables
with expectation exactly µ by adding enough new extra variables to make a
second sum T with E [T ] = µ − µ0 . Now S + T satisfies the conditions for
(5.2.1), and S ≤ S + T always, so

Pr [S ≥ (1 + δ)µ] ≤ Pr [S + T ≥ (1 + δ)µ]


<= .
(1 + δ)1+δ

This also works for any of the bounds in §5.2.2 that are derived from
(5.2.1).
In the other direction, if we know E [S] ≥ µ and want to apply the lower
tail bound (5.2.5), we can apply a slightly different construction. Suppose
that E [S] = µ0 ≥ µ. For each Xi , construct a 0-1 random variable Yi such
that (a) all the Yi are independent of each other, (b) Yi ≤ Xi always, and (c)
E [Yi ] = E [Xi ] (µ/µ0 ). The easiest way to do this is to set Yi = Xi Zi where
each Zi is an independent biased coin with expectation µ/µ0 .
Let T = Yi . Then T ≤ S and E [T ] = E [Yi ] = E [Xi ] (µ/µ0 ) = µ.
P P P
CHAPTER 5. CONCENTRATION BOUNDS 74

Since T satisfies the requirements of (5.2.5), we can argue

Pr [S ≤ (1 − δ)µ] ≤ Pr [T ≤ (1 − δ)µ]

e−δ
≤ .
(1 − δ)1−δ

As with the upper tail bound, this approach also works for simpified
versions of the lower tail bound like (5.2.6).
For the two-sided variants, we are out of luck. The best we can do if we
know a ≤ E [S] ≤ b is to apply each of the one-sided bounds separately.

5.2.6 Almost-independent variables


Chernoff bounds generally don’t work very well for variables that are not
independent, and in most such cases we must use Chebyshev’s inequality
(§5.1) or the Azuma-Hoeffding inequality (§5.3) instead. But there is one
special case that comes up occasionally where it helps to be able to apply
the Chernoff bound to variables that are almost independent in a particular
technical sense.

Lemma 5.2.2. Let X1 , . . . , Xn be 0–1 random variables with the property


that E [Xi | X1 , . . . , Xi−1 ] ≤ pi ≤ 1 for all i. Let µ = ni=1 pi and S =
P
Pn
i=1 Xi . Then (5.2.1) holds.
Alternatively, let E [Xi | X1 , . . . , Xi−1 ] ≥ pi ≥ 0 for all i, and let µ =
Pn Pn
i=1 pi and S = i=1 Xi as before. Then (5.2.5) holds.

Proof. Rather than repeat the argument for independent variables, we will
employ a coupling, where we replace the Xi with independent Yi so that
Pn Pn
i=1 Yi gives a bound an i=1 Xi .
For the upper bound, let each Yi = 1 with independent probability
pi . Use the following process to generate a new Xi0 in increasing order
of i: if Yi = 0, set Xi0 = 0. Otherwise set Xi0 = 1 with probability
Pr Xi = 1 X1 = X10 , . . . Xi−1
0 /pi , Then Xi0 ≤ Yi , and

Pr Xi0 = 1 X10 , . . . , Xi0 = Pr Xi = 1 X1 = X10 , . . . , Xi−1 = Xi−1


0
    
/pi Pr [Yi = 1]
= Pr Xi = 1 X1 = X10 , . . . , Xi−1 = Xi−1
0
 
.
CHAPTER 5. CONCENTRATION BOUNDS 75

It follows that the Xi0 have the same joint distribution as the Xi , and so
" n # " n #
Xi0
X X
Pr ≥ µ(1 + δ) = Pr Xi ≥ µ(1 + δ)
i=1 i=1
" n #
X
≤ Pr Yi ≥ µ(1 + δ)
i=1


≤ .
(1 + δ)1+δ

For the other direction, generate the Xi first and generate the Yi using
the same rejection sampling trick. Now the Yi are independent (because
their joint distribution is) and each Yi is a lower bound on the corresponding
Xi .

The lemma is stated for the general Chernoff bounds (5.2.1) and (5.2.5),
but the easier versions follow from these, so they hold as well, as long as we
are careful to remember that the µ in the upper bound is not necessarily the
same µ as in the lower bound.

5.2.7 Other tail bounds for the binomial distribution


The random graph literature can be a good source for bounds on the binomial
distribution. See for example [Bol01, §1.3], which uses normal approximation
to get bounds that are slightly tighter than Chernoff bounds in some cases,
and [JLR00, Chapter 2], which describes several variants of Chernoff bounds
as well as tools for dealing with sums of random variables that aren’t fully
independent.

5.2.8 Applications
5.2.8.1 Flipping coins
Suppose S is the sum of n independent fair coin-flips. Then E [S] = n/2 and
Pr [S = n] = Pr [S ≥ 2 E [S]] is bounded using (5.2.1) by setting µ = n/2,

δ = 1 to get Pr [S = n] ≤ (e/4)n/2 = (2/ e)−n . This is not quite as good as

the real answer 2−n (the quantity 2/ e is about 1.213. . . ), but it’s at least
exponentially small.
CHAPTER 5. CONCENTRATION BOUNDS 76

5.2.8.2 Balls in bins again


Let’s try applying the Chernoff bound to the balls-in-bins problem. Here
we let S = m
P
i=1 Xi be the number of balls in a particular bin, with Xi
the indicator that the i-th ball lands in the bin, E [Xi ] = pi = 1/n, and
E [S] = µ = m/n. To get a bound on Pr [S ≥ m/n + k], apply the Chernoff
bound with δ = kn/m to get

Pr [S ≥ m/n + k] = Pr [S ≥ (m/n)(1 + kn/m)]


ek
≤ .
(1 + kn/m)1+kn/m
For m = n, this collapses to the somewhat nicer but still pretty horrifying
ek /(k + 1)k+1 .
Staying with m = n, if we are bounding the probability of having large
bins, we can use the 2−R variant to show that the probability that any
particular bin has more than 2 lg n balls (for example), is at most n−2 , giving
the probability that there exists such a bin of at most 1/n. This is not as
strong as what we can get out of the full Chernoff bound. If we take the
logarithm of ek /(k + 1)k+1 , we get k − (k + 1) ln(k + 1); if we then substitute
k = lnc lnlnnn − 1, we get
c ln n c ln n c ln n
−1− ln
ln ln n  ln ln n ln ln n
c 1 c

= (ln n) − − (ln c + ln ln n − ln ln ln n)
ln ln n ln n ln ln n
c 1 c ln c c ln ln ln n
 
= (ln n) − − −c+
ln ln n ln n ln ln n ln ln n
= (ln n)(−c + o(1)).

So the probability of getting more than c ln n/ ln ln n balls in any one bin


is bounded by exp((ln n)(−c + o(1))) = n−c+o(1) . This gives a maximum bin
size of O(log n/ log log n) with any fixed probability bound n−a for sufficiently
large n.

5.2.8.3 Flipping coins, central behavior


Suppose we flip n fair coins, and let S be the number that come up heads.
We expect µ = n/2 heads on average. How many extra heads can we get, if
we want to stay within a probability bound of n−c ?
p Pr [S ≥ (1 +pδ)(n/2)] ≤
Here we use the small-δ approximation, which gives
exp(−δ 2 n/6). Setting exp(−δ 2 n/6) = n−c gives δ = 6 ln nc /n = 6c ln n/n.
CHAPTER 5. CONCENTRATION BOUNDS 77

110 111

010 011

100 101

000 001

Figure 5.2: Hypercube network with n = 3

q
The actual excess over the mean is δ(n/2) = (n/2) 6c ln n/n = 32 cn ln n.
p

By symmetry, the same bound applies to extra tails. So if we flip 1000


coins and see more than 676 heads (roughly the bound when c=3), we can
reasonably conclude that either (a) our coin is biased, or (b) we just hit a
rare one-in-a-billion jackpot. p
In algorithm analysis, the (3/2)c part usually gets absorbed into the
−c
√ with probability at least 1 − n ,
asymptotic notation, and we just say that
the sum of n random bits is within O( n log n) of n/2.

5.2.8.4 Permutation routing on a hypercube


Here we use Chernoff bounds to show bounds on a classic permutation-
routing algorithm for hypercube networks due to Valiant [Val82]. The
presentation here is based on §4.2 of [MR95], which in turn is based on an
improved version of Valiant’s original analysis that appeared in a follow-up
paper with Brebner [VB81]. There’s also a write-up of this in [MU17, §4.6.1].
The basic idea of a hypercube architecture is that we have a collection of
N = 2n processors, each with an n-bit address. Two nodes are adjacent if
their addresses differ by one bit (see Figure 5.2 for an example). Though
now mostly of theoretical interest, these things were the cat’s pajamas back
in the 1980s: see http://en.wikipedia.org/wiki/Connection_Machine.
Suppose that at some point in a computation, each processor i wants to
send a packet of data to some processor π(i), where π is a permutation of
the addresses. But we can only send one packet per time unit along each of
CHAPTER 5. CONCENTRATION BOUNDS 78

the n edges leaving a processor.6 How do we route the packets so that all of
them arrive in the minimum amount of time?
We could try to be smart about this, or we could use randomization.
Valiant’s idea is to first route each process i’s packet to some random
intermediate destination σ(i), then in the second phase, we route it from σ(i)
to its ultimate destination π(i). Unlike π, σ is not necessarily a permutation;
instead, σ(i) is chosen uniformly at random independently of all the other
σ(j). This makes the choice of paths for different packets independent of
each other, which we will need later to apply Chernoff bounds..
Routing is done by a bit-fixing: if a packet is currently at node x and
heading for node y, find the leftmost bit j where xj = 6 yj and fix it, by
sending the packet on to x[xj /yj ]. In the absence of contention, bit-fixing
routes a packet to its destination in at most n steps. The hope is that the
randomization will tend to spread the packets evenly across the network,
reducing the contention for edges enough that the actual time will not be
much more than this.
The first step is to argue that, during the first phase, any particular
packet is delayed at most one time unit by any other packet whose path
overlaps with it. Suppose packet i is delayed by contention on some edge uv.
Then there must be some other packet j that crosses uv during this phase.
From this point on, j remains one step ahead of i (until its path diverges),
so it can’t block i again unless both are blocked by some third packet k (in
which case we charge i’s further delay to k). This means that we can bound
the delays for packet i by counting how many other packets cross its path.7
So now we just need a high-probability bound on the number of packets that
get in a particular packet’s way.
Following the presentation in [MR95], define Hij to be the indicator
variable for the event that packets i and j cross paths during the first phase.
Because each j chooses its destination independently, once we fix i’s path,
P
the Hij are all independent. So we can bound S = j6=i Hij using Chernoff
bounds. To do so, we must first calculate an upper bound on µ = E [S].
The trick here is to observe that any path that crosses i’s path must cross
one of its edges, and we can bound the number of such paths by bounding
how many paths cross each edge. For each edge e, let Te be the number
of paths that cross edge e, and for each j, let Xj be the number of edges
P P
that path j crosses. Counting two ways, we have e Te = j Xj , and so
6
Formally, we have a synchronous routing model with unbounded buffers at each node,
with a maximum capacity of one packet per edge per round.
7
A much more formal version of this argument is given as [MR95, Lemma 4.5].
CHAPTER 5. CONCENTRATION BOUNDS 79
hP i
Xj ≤ N (n/2). By symmetry, all the Te have the same
P
E[ e Te ] =E j
expectation, so we get E [Te ] ≤ N N
(n/2)
n = 1/2.
Now fix σ(i). This determines some path e1 e2 . . . ek for packet i. In
general we do not expect E [Te` | σ(i)] to equal E [Te` ], because conditioning
on i’s path crossing e` guarantees that at least one path crosses this edge that
might not have. However, if we let Te0 be the number of packets j 6= i that
cross e, then we have Te0 ≤ Te always, giving E [Te0 ] ≤ E [Te ], and because Te0
does not depend on i’s path, E [Te0 | σ(i)] = E [Te0 ] ≤hE [Te ] ≤ 1/2. Summing
i
this bound over all k ≤ n edges on i’s path gives E ≤ n/2,
P
H
j6=i ij σ(i)
hP i
which implies E j6=i Hij ≤ n/2 after removing the conditioning on σ(i).
Inequality (5.2.4) says that Pr [X ≥ R] ≤ 2−R when R ≥ 2eµ. Letting
X = j6=i Hij and setting R = 3n gives R = 6(n/2) ≥ 6µ > 2eµ, so
P
hP i
−3n = N −3 . This says that any one packet reaches its
Pr j Hij ≥ 3n ≤ 2
random destination with at most 3n added delay (thus, in at most 4n time
units) with probability at least 1 − N −3 . If we consider all N packets, the
total probability that any of them fail to reach their random destinations in
4n time units is at most N · N −3 = N −2 . Note that because we are using
the union bound, we don’t need independence for this step—which is good,
because we don’t have it.
What about the second phase? Here, routing the packets from the random
destinations back to the real destinations is just the reverse of routing them
from the real destinations to the random destinations. So the same bound
applies, and with probability at most N −2 some packet takes more than
4n time units to get back (this assumes that we hold all the packets before
sending them back out, so there are no collisions between packets from
different phases).
Adding up the failure probabilities and costs for both stages gives a
probability of at most 2/N 2 that any packet takes more than 8n time units
to reach its destination.
The structure of this argument is pretty typical for applications of Cher-
noff bounds: we get a very small bound on the probability that something
bad happens by applying Chernoff bounds to a part of the problem where
we have independence, then use the union bound to extend this to the full
problem where we don’t.
CHAPTER 5. CONCENTRATION BOUNDS 80

5.3 The Azuma-Hoeffding inequality


The problem with Chernoff bounds is that they only work for sums of
Bernoulli random variables. Hoeffding’s inequality [Hoe63] is another
concentration bound based on the moment generating function that applies
to any sum of bounded independent random variables with mean 0.8 It has
the additional useful feature that it generalizes nicely to some collections of
random variables that are not mutually independent, as we will see in §5.3.2.
This more general version is known as Azuma’s inequality [Azu67] or the
Azuma-Hoeffding inequality.9

5.3.1 Hoeffding’s inequality


This is the version for sums of bounded independent random variables. We
will consider the symmetric case, where each variable Xi satisfies |Xi | ≤ ci
for some constant ci . Hoeffding’s original result considered bounds of the
form ai ≤ Xi ≤ bi , and is equivalent when ai = −bi .
The main tool is Hoeffding’s lemma, which states
Lemma 5.3.1. Let E [X] = 0 and |X| ≤ c with probability 1. Then
h i 2 /2
E eαX ≤ e(αc) . (5.3.1)

Proof. The basic idea is that, for any α, eαx is a convex function. Since we
want an upper bound, we can’t use Jensen’s inequality (4.3.1), but we can
use the fact that X is bounded and we know its expectation. Convexity of
eαx means that, for any x with −c ≤ x ≤ c, eαx ≤ λe−αc + (1 − λ)eαc , where
1 x

x = λ(−c) + (1 − λ)c. Solving for λ in terms of x gives λ = 2 1 − c and
1 x

1 − λ = 2 1 + c . So
1 X −αc 1 X αc
h i      
αX
E e ≤E 1− e + 1+ e
2 c 2 c
e−αc + eαc e−αc eαc
= − E [X] + E [X]
2 2c 2c
e−αc + eαc
=
2
= cosh(αc).
8
Note that the requirement that E [Xi ] = 0 can always be satisfied by considering
instead Yi = Xi − E [Xi ].
9
The history of this is that Hoeffding [Hoe63] proved it for independent random variables,
and observed that the proof was easily extended to martingales, while Azuma [Azu67]
actually went and did the work of proving it for martingales.
CHAPTER 5. CONCENTRATION BOUNDS 81

In other words, the worst possible X is a fair choice between ±c, and
in this case we get the hyperbolic cosine of αc as its moment generating
function.
We don’t like hyperbolic cosines much, because we are going to want
to take products of our bounds, and hyperbolic cosines don’t multiply very
nicely. As before with 1 + x, we’d be much happier if we could replace the
cosh with a nice exponential. The Taylor series expansion of cosh x starts
with 1 + x2 /2 + . . . , suggesting that we should approximate it with exp(x2 /2),
2
and indeed it is the case that for all x, cosh x ≤ ex /2 . This can be shown by
comparing the rest of the Taylor series expansions:
ex + e−x
cosh x =
2
∞ ∞
!
1 X xn X (−x)n
= +
2 n=0 n! n=0 n!

X x2n
=
n=0
(2n)!

X x2n

n=0
2n n!

X (x2 /2)n
=
n=0
n!
x2 /2
=e .

This gives the claimed bound


h i 2 /2
E eαX ≤ cosh(αc) ≤ e(αc) .

.
Theorem 5.3.2. Let X1 . . . Xn be independent random variables with E [Xi ] =
0 and |Xi | ≤ ci for all i. Then for all t,
" n # !
X t2
Pr Xi ≥ t ≤ exp − Pn 2 . (5.3.2)
i=1
2 i=1 ci

Proof. Let S = ni=1 Xi . As with Chernoff hbounds,


P
i we’ll first calculate a
bound on the moment generating function E e αS and then apply Markov’s
inequality with a carefully-chosen α.
CHAPTER 5. CONCENTRATION BOUNDS 82
h i 2 /2
From (5.3.1), we have E eαXi ≤ e(αci ) for all i. Using this bound and
the independence of the Xi , we compute
n
" !#
h i X
αS
E e = E exp α Xi
i=1
" n #
Y
=E eαXi
i=1
n
Y h i
= E eαXi
i=1
n
Y 2 /2
≤ e(αci )
i=1
n
!
X α2 c2i
= exp
i=1
2
n
!
α2 X
= exp c2i .
2 i=1

Applying Markov’s inequality then gives (when α > 0):


h i
Pr [S ≥ t] = Pr eαS ≥ eαt
n
!
α2 X
≤ exp c2 − αt . (5.3.3)
2 i=1 i

Now we do the same trick as in Chernoff bounds and choose α to minimize


the bound. If we write C for ni=1 c2i , this is done by minimizing the exponent
P
α2
2 C − αt, which we do by taking the derivative with respect to α and setting
it to zero: αC − t = 0, or α = t/C. At this point, the exponent becomes
(t/C)2 t2
2 C − (t/C)t = − 2C .
Plugging this into (5.3.3) gives the bound (5.3.2) claimed in the theorem.

5.3.1.1 Hoeffding vs Chernoff


Let’s see how good a bound this gets us for our usual test problem of bounding
Pr [S = n] where S = ni=1 Xi is the sum of n independent fair coin-flips. To
P

make the problem fit the theorem, we replace each Xi by a rescaled version
Yi = 2Xi − 1 = ±1 with equal probability; this makes E [Yi ] = 0 as needed,
CHAPTER 5. CONCENTRATION BOUNDS 83

with |Yi | ≤ ci = 1. Hoeffding’s inequality (5.3.2) then gives


" n # !
X n2
Pr Yi ≥ n ≤ exp −
i=1
2n

= e−n/2 = ( e)−n .
√ √
Since e ≈ 1.649 . . . , this is actually slightly better than the (2/ e)−n
bound we get using Chernoff bounds.
On the other hand, Chernoff bounds work better if we have a more
skewed distribution on the Xi ; for example, in the balls-in-bins case, each
Xi is a Bernoulli random variable with E [Xi ] = 1/n. Using Hoeffding’s
inequality, we get a bound ci on |Xi − E [Xi ]| of only 1 − 1/n, which puts
Pn 2 √
i=1 ci very close to n, requiring t = Ω( n) before we get any non-trivial
bound out of (5.3.2), pretty much the same as in the fair-coin case (which is
not surprising, since Hoeffding’s inequality doesn’t know anything about the
distribution of the Xi ). But we’ve already seen that Chernoff gives us that
P
Xi = O(log n/ log log n) with high probability in this case.

5.3.1.2 Asymmetric version


The original version of Hoeffding’s inequality [Hoe63] assumes ai ≤ Xi ≤ bi ,
but E [Xi ] is still zero for all Xi . In this version, the bound is
" n # !
X 2t2
Pr Xi ≥ t ≤ exp − Pn 2
. (5.3.4)
i=1 i=1 (bi − ai )

This reduces to (5.3.2) when ai = −ci and bi = ci . The proof is essentially


the
h same,
i but a little more analytic sneakery is required to show that
2 2
E eαXi ≤ eα (bi −ai ) /8 ; see [McD89] for a proof of this that is a little more
approachable than Hoeffding’s original paper. For most applications, the
only difference between the symmetric version (5.3.2) and the asymmetric
version (5.3.4) is a small constant factor on the resulting bound on t.
Hoeffding’s inequality is not the tightest possible inequality that can be
obtained from the conditions under which it applies, but is relatively simple
and easy to work with. For a particularly strong version of Hoeffding’s
inequality and a discussion of some variants, see [FGQ12].

5.3.2 Azuma’s inequality


A general rule of thumb is that most things that work for sums of independent
random variables also work for martingales, which are sequences of random
variables that have similar behavior but allow for more dependence.
CHAPTER 5. CONCENTRATION BOUNDS 84

Formally, a martingale is a sequence of random variables S0 , S1 , S2 , . . . ,


where E [St | S1 , . . . , St−1 ] = St−1 . In other words, given everything you
know up until time t − 1, your best guess of the expected value at time t is
just wherever you are now.
Another way to describe a martingale is to take the partial sums St =
Pt
i=1 Xt of a martingale difference sequence, which is a sequence of
random variables X1 , X2 , . . . where E [Xt | X1 . . . Xt−1 ] = 0. So in this
version, your expected change from time t − 1 to t averages out to zero, even
if you try to predict it using all the information you have at time t − 1.
In some cases it makes sense to allow extra information to sneak in. We
can represent this using σ-algebras, in particular by using a filtration of the
form F0 ⊆ F1 ⊆ F2 ⊆ . . . , where each Ft is a σ-algebra (see §3.5.3). A
sequence S0 , S1 , S2 , . . . is adapted to a filtration F0 ⊆ F1 ⊆ F2 ⊆ . . . if each
St is Ft -measurable. This means that at time t the sum of our knowledge
(Ft ) is enough to predict exactly the value of St . The subset relations also
mean that we remember everything there is to know about St0 for t0 < t.
The general definition of a martingale is a collection {(St , Ft ) | t ∈ N}
where

1. Each St is Ft -measurable; and

2. E [St+1 | Ft ] = St .

This means that even if we include any extra information we might have
at time t, we still can’t predict St+1 any better than by guessing the current
value St . This alternative definition will be important in some special cases,
as when St is a function of some other collection of random variables that
we use to define the Ft . Because Ft includes at least as much information as
S0 , . . . , St , it will always be the case that any sequence {(St , Ft )} that is a
martingale in the general sense gives a sequence {St } that is a martingale in
the more specialized E [St+1 | S0 , . . . , St ] = St sense.
Martingales were invented to analyze fair gambling games, where your
return over some time interval is not independent of previous outcomes (for
example, you may change your bet or what game you are playing depending
on how things have been going for you), but it is always zero on average
given previous information.10 The nice thing about martingales is they allow
10
Real casinos give negative expected return, so your winnings in a real casino form
a supermartingale with St ≥ E [St+1 | S0 . . . St ]. On the other hand, the casino’s take,
in a well-run casino, is a submartingale, a process with St ≤ E [St+1 | S0 . . . St ]. These
definitions also generalize in the obvious way to the {(St , Ft )} case.
CHAPTER 5. CONCENTRATION BOUNDS 85

for a bit of dependence while still acting very much like sums of independent
random variables.
Where this comes up with Hoeffding’s inequality is that we might have a
process that is reasonably well-behaved, but its increments are not technically
independent. For example, suppose that a gambler plays a game where
they bet x units 0 ≤ x ≤ 1 at each round, and receives ±x with equal
probability. Suppose also that their bet at each round may depend on the
outcome of previous rounds (for example, they might stop betting entirely
if they lose too much money). If Xi is their take at round i, we have that
E [Xi | X1 . . . Xi−1 ] = 0 and that |Xi | ≤ 1. This is enough to apply the
martingale version of Hoeffding’s inequality, often called Azuma’s inequality.
Pk
Theorem 5.3.3. Let {Sk } be a martingale with Sk = i=1 Xi and |Xi | ≤ ci
for all i. Then for all n and all t ≥ 0:
!
−t2
Pr [Sn ≥ t] ≤ exp Pn 2 . (5.3.5)
2 i=1 ci
h i  2

Proof. Basically, we just show that E eαSn ≤ exp α2 ni=1 c2i —just like
P

in the proof of Theorem 5.3.2—and the rest follows using the same argument.
The
hQonly trickyi part is weh cani no longer use independence to transform
n αX into ni=1 E eαXi .
Q
E i=1 e i

Instead, we use the martingale property.


h i have E [Xi | X1 . . . Xi−1 ] =
For each Xi , we
0 and |Xi | ≤ ci always. Recall that E eαXi X1 . . . Xi−1 is a random vari-
able that takes on the average value of eαXi for each setting of X1 . . . Xi−1 .
We can applyhthe same analysisias in the proof of 5.3.2 to show that this
2
means that E eαXi X1 . . . Xi−1 ≤ e(αci ) /2 always.
The trick is to use the fact that, for any random variables X and Y ,
E [XY ] = E [E [XY | X]] = E [X E [Y | X]].
hQ i 2
We argue by induction on n that E ni=1 eαXi ≤ ni=1 e(αc) /2 . The
Q
CHAPTER 5. CONCENTRATION BOUNDS 86

base case is when n = 0. For the induction step, compute


" n # " " n ##
Y Y
E eαXi = E E eαXi X1 . . . Xn−1
i=1 i=1
n
" ! #
Y h i
αXi αXn
=E e E e X1 . . . Xn−1
i=1
n−1
" ! #
(αcn )2 /2
Y
αXi
≤E e e
i=1
"n−1 #
Y
αXi 2 /2
=E e e(αcn )
i=1
n−1
!
(αci )2 /2 2 /2
Y
≤ e e(αcn )
i=1
n
Y 2 /2
= e(αci )
i=1
n
!
α2 X
= exp c2 .
2 i=1 i

The rest of the proof goes through as before.

Some extensions:
• The asymmetric version of Hoeffding’s inequality (5.3.4) also holds for
martingales. So if each increment Xi satisfies ai ≤ Xi ≤ bi always,
" n # !
X 2t2
Pr Xi ≥ t ≤ exp − Pn 2
. (5.3.6)
i=1 i=1 (bi − ai )

• The same bound works for bounded-difference supermartingales. A


supermartingale is a process where E [Xi | X1 . . . Xi−1 ] ≤ 0; the idea is
that my expected gain at any step is non-positive, so my present wealth
is always superior to my future wealth.11 If E [Xi | X1 . . . Xi−1 ] ≤ 0 and
|Xi | ≤ ci , then we can write Xi = Yi +Zi where Yi = E [Xi | X1 . . . Xi−1 ] ≤
0 is predictable from X1 . . . Xi−1 and E [Zi | X1 . . . Xi−1 ] = 0.12 Then
we can bound ni=1 Xi by observing that it is no greater than ni=1 Zi .
P P

11
The corresponding notion in the other direction is a submartingale. See §9.2.
12
This is known as a Doob decomposition and can be used to extract a martingale
{Zi } from any stochastic process {Xi }. For general processes, Yi = Xi − Zi will still be
predictable, but may not satisfy E [Yi | X1 , . . . , Xi−1 ] ≤ 0.
CHAPTER 5. CONCENTRATION BOUNDS 87

A complication is that we no longer have |Zi | ≤ ci ; instead, |Zi | ≤ 2ci


(since leaving out Yi may shift Zi up). But with this revised bound,
(5.3.5) gives
" n # " n #
X X
Pr Xi ≥ t ≤ Pr Zi ≥ t
i=1 i=1
!
t2
≤ exp − Pn 2 . (5.3.7)
8 i=1 ci

• Suppose that we stop the process after the first time τ with Sτ =

i=1 Xi ≥ t. This is equivalent to making a new variable Yi that is
zero whenever Si−1 ≥ t and equal to Xi otherwise. This doesn’t affect
the conditions E [Yi | Y1 . . . Yi−1 ] = 0 or |Yi | ≤ ci , but it makes it so
Pn
Yi ≥ t if and only if maxk≤n ki=1 Xi ≥ t. Applying (5.3.5) to
P
P i=1
Yi then gives
k
" # !
X t2
Pr max Xi ≥ t ≤ exp − Pn 2 . (5.3.8)
k≤n
i=1
2 i=1 ci

• Since the conditions on Xi in Theorem 5.3.3 apply equally well to −Xi ,


we have
" n # !
X t2
Pr Xi ≤ −t ≤ exp − Pn 2 . (5.3.9)
i=1
2 i=1 ci

which we can combine with (5.3.5) to get the two-sided bound


" n # !
X t2
Pr Xi ≥ t ≤ 2 exp − Pn 2 . (5.3.10)
i=1
2 i=1 ci

• The extension of Hoeffding’s inequality to the case ai ≤ Xi ≤ bi works


equally well for Azuma’s inequality, giving the same bound as in (5.3.4).

• Finally, one can replace the requirement that each ci be a constant


with a requirement that ci be predictable from X1 . . . Xi−1 and that
Pn 2 Pn −t2 /2C . This generally
i=1 ci ≤ C always and get Pr [ i=1 Xi ≥ t] ≤ e
doesn’t come up unless you have an algorithm that explicitly cuts off
the process if c2i gets too big, but there is at least one example of
P

this in the literature [AW96].


CHAPTER 5. CONCENTRATION BOUNDS 88

There are also cases where the asymmetric version works with ai ≤
Xi ≤ bi where a bound on bi − ai is fixed but the precise values of
ai and bi may vary depending on X1 , . . . , Xi−1 . This shows up in the
proof of McDiarmid’s inequality [McD89], which is described below in
§5.3.3.

5.3.3 The method of bounded differences


To use Azuma’s inequality, we need a bounded-difference martingale. The
easiest way to get such martingales is through the method of bounded dif-
ferences, which was popularized by a survey paper by McDiarmid [McD89].
For this reason the key result is often referred to as McDiarmid’s inequal-
ity.
The basic idea of the method is to structure a problem so that we are
computing a function f (X1 , . . . , Xn ) of a sequence of independent random
variables X1 , . . . , Xn . To get our martingale, we’ll imagine we reveal the Xi
one at a time, and compute at each step the expectation of the final value of
f based on just the inputs we’ve seen so far.
Formally, let Ft = hX1 , . . . , Xt i, the σ-algebra generated by X1 through
Xt . This represents all the information we have at time t. Let Yt = E [f | Ft ],
the expected value of f given the values of the first t variables. Then {hYt , Ft i}
forms a martingale, with Y0 = E [f ] and Yt = E [f | X1 , . . . , Xt ] = f . So if
we can find a bound ct on |Yt − Yt−1 |, we can apply Azuma’s inequality to
get bounds on Yn − Y0 = f − E [f ].
A sequence of random variables of the form Yt = E [Z | Ft ], where Z is
some fixed random variable and F0 ⊆ F1 ⊆ F2 ⊆ . . . is a filtration, is a called
a Doob martingale, and this is one of the most common ways to construct
a martingale. The proof that a Doob martingale is in fact a martingale
is immediate from the general version of the law of iterated expectation
E [Yt+1 | Ft ] = E [E [Z | Ft+1 ] | Ft ] = E [Z | Ft ] = Yt . Not all martingales are
Doob martingales: for example, the martingale whose difference sequence
consists of fair ±1 coin-flips doesn’t converge to any random variable in the
limit.13
To show that Yt meets the conditions for Azuma’s inequality, we require
13
A question came up in class whether martingales that converge at some finite time
n are technically Doob martingales with respect to Ft = hY0 , Y1 , . . . Yt i. This is sort of
true, since we have that E [Yn | Ft ] = Yt for all t. However this doesn’t help much for
constructing a particular martingale. If we let Yt = 0 for all t < k and Yt = Yk = ±1 for
all t ≥ k, then for any n ≥ k, ({hYt , hY0 , . . . , Yt ii}) is a Doob martingale generated by Yn .
But this doesn’t tell us anything about which Yk does the coin-flip.
CHAPTER 5. CONCENTRATION BOUNDS 89

that f has the bounded difference property, which says that there are
bounds ct such that for any x1 . . . xn and any x0t , we have

f (x1 . . . xt . . . xn ) − f (x1 . . . x0t . . . xn ) ≤ ct . (5.3.11)

We want to bound |Yt+1 − Yt | = |E [f | Ft+1 ] − E [f | Ft ]|. We can do


this showing Yt+1 − Yt ≤ ct+1 . Because f can be replaced by −f without
changing the bounded difference property, essentially the same argument
will show Yt+1 − Yt ≥ −ct+1 .
Fix some possible value xt+1 for Xt+1 . The bounded difference property
says that

|f (X1 , . . . , Xt , xt+1 , Xt+2 , . . . , Xn )


− f (X1 , . . . , Xt , Xt+1 , Xt+2 , . . . , Xn )|
≤ct+1 ,

so

| E [f (X1 , . . . , Xt , xt+1 , Xt+2 , . . . , Xn ) | X1 , . . . , Xt ]


− E [f (X1 , . . . , Xt , Xt+1 , Xt+2 , . . . , Xn ) | X1 , . . . , Xt ]|
≤ct+1 . (5.3.12)

The second conditional expectation is just Yt . What is the first one?


Recall

Yt+1 = E [f (X1 , . . . , Xn | X1 , . . . , Xt+1 ]


= E [f (x1 , . . . , xt+1 , Xt+2 , . . . , Xn ]

when Xi = xi for all i ≤ t + 1, since conditioning on X1 , . . . , Xt+1 just


replaces the random variables with their actual values and then averages
over the rest. This is the same as

E [f (X1 , . . . , Xt , xt+1 , Xt+2 , . . . , Xn | X1 , . . . , Xt ]


= E [f (x1 , . . . , xt+1 , Xt+2 , . . . , Xn )]

when xt+1 happens to be the value of Xt+1 . So the first conditional expecta-
tion in (5.3.12) is just Yt+1 , giving

|Yt+1 − Yt | ≤ ct+1 .
CHAPTER 5. CONCENTRATION BOUNDS 90

Now we can apply Azuma-Hoeffding to get


!
t2
Pr [Yn − Y0 ≥ t] ≤ exp − Pn 2 .
2 i=1 ci

This turns out to overestimate the possible range of Yt+1 . With a more
sophisticated argument, it can be shown that for any fixed x1 , . . . , xt , there
exist bounds at+1 ≤ Yt+1 − Yt ≤ bt+1 such that bt+1 − at+1 = ct+1 . We
would like to use this to apply the asymmetric version of Azuma-Hoeffding
given in (5.3.6). A complication is that the specific values of at+1 and bt+1
may depend on the previous values x1 , . . . , xt , even if the bound ct+1 on
their maximum difference does not. Fortunately, McDiarmid shows that the
inequality works anyway, giving:

Theorem 5.3.4 (McDiarmid’s inequality [McD89]). Let X1 , . . . , Xn be


independent random variables and let f (X1 , . . . , Xn ) have the bounded differ-
ence property with bounds ci . Then
!
2t2
Pr [f (X1 , . . . , Xn ) − E [f (X1 , . . . , Xn )] ≥ t] ≤ exp − Pn 2 . (5.3.13)
i=1 ci

The main difference between this and a direct application of Azuma-


Hoeffding is that the constant factor in the exponent is better by a factor of
4.
Since −f satisfies the same constraints as f , the same bound holds for
Pr [f (X1 , . . . , Xn ) − E [f (X1 , . . . , Xn )] ≤ −t]. For some applications it may
make sense to apply the union bound to get a two-sided version
!
2t2
Pr [|f (X1 , . . . , Xn ) − E [f (X1 , . . . , Xn )]| ≥ t] ≤ 2 exp − Pn 2 .
i=1 ci
(5.3.14)

5.3.4 Applications
Here are some applications of the preceding inequalities. Most of these are
examples of the method of bounded differences.

5.3.4.1 Sprinkling points on a hypercube


Suppose you live in a hypercube, and the local government has conveniently
placed mailboxes on some subset A of the nodes. If you start at a random
CHAPTER 5. CONCENTRATION BOUNDS 91

location, how likely is it that your distance to the nearest mailbox deviates
substantially from the average distance?
We can describe your position as a bit vector X1 , . . . , Xn , where each
Xi is an independent random bit. Let f (X1 , . . . , Xn ) be the distance from
X1 , . . . , Xn to the nearest element of A. Then changing one of the bits
changes this function by at most 1. So we have Pr [|f − E [f ]| ≥ −2t2 /n
√ t] ≤ 2e
by (5.3.13), giving a range of possible distances that is O( n log n) with
probability at least 1 − n−c for any fixed c > 0.14 Of course, without knowing
what A is, we don’t know what E[f ] is; but at least we can be assured that
(unless A is very big) the distance we have to walk to send our mail will be
pretty much the same pretty much wherever we start.

5.3.4.2 Chromatic number of a random graph


Consider a random graph G(n, p) consisting of n vertices, where each pos-
sible edge appears with independent probability p. Let χ be the chromatic
number of this graph, the minimum number of colors necessary if we want
to assign a color to each vertex that is distinct for the colors of all of its
neighbors. The vertex exposure martingale shows us the vertices of the
graph one at a time, along with all the edges between vertices that have
been exposed so far. We define Xt to be the expected value of χ given this
information for vertices 1 . . . t.
If Zi is a random variable describing which edges are present between i
and vertices less than i, then the Zi are all independent, and we can write
χ = f (Z1 , . . . , Zn ) for some function f (this function may not be very easy to
compute, but it exists). Then Xt as defined above is just E [f | Z1 , . . . , Zt ].
Now observe that f has the bounded difference property with ct = 1: if I
change the edges for some vertex vt , I can’t increase the number of colors
I need by more than 1, since in the worst case I can always take whatever
coloring I previously had for all the other vertices and add a new color for
vt . This implies that the difference between any two graphs G and G0 that
differ only in the value of some Zt is at most one, because going between
them is an increase in one direction or the other.
McDiarmid’s inequality (5.3.13) then says that Pr [|χ − E [χ]| ≥ t] ≤
2
2e−2t /n ; in other words, the chromatic number of a random graph is tightly
concentrated around its mean, even if we don’t know what that mean is.
This proof is due to Shamir and Spencer [SS87]. Much better bounds are
known on the expected value and distribution of χ(G(n, p)) for many values
p1 √ 2
14
Proof: Let t = 2
(c + 1)n ln n = O( n log n). Then 2e−2t /n = 2e−(c+1) ln n =
−c−1 −c
2n < n when n is sufficiently large.
CHAPTER 5. CONCENTRATION BOUNDS 92

of n and p than are given by this crude result. For a more recent paper that
cites many of these see [Hec21].

5.3.4.3 Balls in bins


Suppose we toss m balls into n bins. How many empty bins do we get?
The probability that each bin individually is empty is exactly (1 − 1/n)m ,
which is approximately e−m/n when n is large. So the expected number of
empty bins is exactly n(1 − 1/n)m . If we let Xi be the bin that ball i gets
tossed into, and let Y = f (X1 , . . . , Xm ) be the number of empty bins, then
changing a single Xi can change f by at most 1. So from (5.3.13) we have
2
Pr [Y ≥ n(1 − 1/n)m + t] ≤ e−2t /m .

5.3.4.4 Probabilistic recurrence relations


Most probabilistic recurrence arguments (as in Appendix I) can be interpreted
as supermartingales: the current estimate of T (n) is always greater than or
equal to the expected estimate after doing one stage of the recurrence. This
fact can be used to get concentration bounds using (5.3.7).
For example, let’s take the recurrence (1.3.1) for the expected number of
comparisons for QuickSort:

1 n−1
X
T (n) = (n − 1) + (T (k) + T (n − 1 − k)) .
n k=0

We showed in §1.3.1 that the solution to this recurrence satisfies T (n) ≤


2n ln n.
To turn this into a supermartingale, imagine that we carry out a pro-
cess where we keep around at each step t a set of unsorted blocks of size
nt1 , nt2 , . . . , ntkt for some kt (note that the superscripts on nti are not expo-
nents). One step of the process involves choosing one of the blocks (we can
do this arbitrarily without affecting the argument) and then splitting that
block around a uniformly-chosen pivot. We will track a random variable Xt
equal to Ct + ki=1 2nti ln nti , where Ct is the number of comparisons done
P t

so far and the summation gives an upper bound on the expected number of
comparisons remaining.
To show that this is in fact a supermartingale, observe that if we partition
a block of size n we add n − 1 to Ct but replace the cost bound 2n ln n by
CHAPTER 5. CONCENTRATION BOUNDS 93

an expected

1 n−1
Z n
X 4
2· 2k ln k ≤ n ln n
n k=0 n 2
!
4 n2 ln n n2
= − − ln 2 + 1
n 2 4
= 2n ln n − n − ln 2 + 1
< 2n ln n − n.

The net change is less than − ln 2. The fact that it’s not zero suggests
that we could improve the 2n ln n bound slightly, but since it’s going down,
we have a supermartingale.
Let’s try to get a bound on how much Xt changes at each step. The Ct
part goes up by at most n − 1. The summation can only go down; if we split
a block of size ni , the biggest drop we get is if we split it evenly,15 This gives
a drop of
n−1 n−1 n−1
   
2n ln n − 2 2 ln = 2n ln n − 2(n − 1) ln n
2 2 2n
2n
= 2n ln n − 2(n − 1)(ln n − ln )
n−1
2n 2n
= 2n ln n − 2n ln n + 2n ln + 2 ln n − 2 ln
n−1 n−1
= 2n · O(1) + O(log n)
= O(n).

(with a constant tending to 2 in the limit).


So we can apply (5.3.7) with ct = O(n) to the at most n steps of the
algorithm, and get
2 /O(n3 )
Pr [Cn − 2n ln n ≥ t] ≤ e−t .

This gives Cn = O(n3/2 ) with constant probability or O(n3/2 log n) with
all but polynomial probability. This is a rather terrible bound, but it’s much
better than O(n2 ).
Much tighter bounds are known: QuickSort in fact uses Θ(n log n) com-
parisons with high probability [MH92]. If we aren’t too worried about
constants, an easy way to see the upper bound side of this is to adapt the
15
This can be proven most easily using convexity of n ln n.
CHAPTER 5. CONCENTRATION BOUNDS 94

analysis of Hoare’s FIND (§3.6.4). For each element, the number of elements
in the same block is multiplied by a factor of at most 3/4 on average each
time the element is compared, so the chance that the element is not by itself is
at most (3/4)k n after k comparisons. Setting k = log4/3 (n2 /) gives that any
particular element is compared k or times with probability at most /n. The
union bound then gives a probability of at most  that the most-compared
element is compared k or more times. So the total number of comparisons is
O(log(n/)) with probability 1 − , which becomes O(log n) with probability
1 − n−c if we set  = n−c for a fixed c.
We can formalize this argument itself using a supermartingale. Let Cit
be the number of times i has been compared as a non-pivot in the first t
pivot steps and Nit be the size of the block containing i after t pivot steps.
t
Let Yit = (4/3)Ci Nit . Then if we pick i’s block at step t + 1, the exponent
goes up by at most 1 and N t drops by a factor of 3/4, canceling out the
increase.h If we don’t
i pick i’s block, Yit is unchanged. In either case we get
t t+1  t
Yi ≥ E Yi Ft and Yi is a supermartingale.
h i
Now let Z t = i Yit . Since this is greater than or equal to i E Yit+1 Ft =
P P

Z t+1 , Z t is also a supermartingale. For t = 0, Z 0 = n · (4/3)0 · n = n2 .



n
For t = n, Z n = i (4/3)Ci . But then
P
   
Cin
Pr max Cin ≥ a log4/3 n = Pr max(4/3) ≥n a
i i
" #
Cin
X
a
≤ Pr (4/3) ≥n
i
= Pr [Z ≥ na ]
n

E [Z n ]

na
Z 0
≤ a
n
= na−2 .
h i
Choosing a = c+2 gives an n−c bound on Pr maxi Cin ≥ (c + 2) log4/3 n
hP i
n
and thus the same bound on Pr i Ci ≥ (c + 2)n log4/3 n . This shows that
the total number of comparisons is O(n log n) with high probability.

5.3.4.5 Multi-armed bandits


In the multi-armed bandit problem, we must choose at each time step
one of a fixed set of k arms to pull. Pulling arm i at time t yields a return
CHAPTER 5. CONCENTRATION BOUNDS 95

of Xit , a random payoff typically assumed to be between 0 and 1. Suppose


that all the Xit are independent, and that for each fixed i, all Xit have the
same distribution, and thus the same expected payoff. Suppose also that we
initially know nothing about these distributions. What strategy can we use
to maximize our expected payoff over a large number of pulls?
More specifically, we want to minimize our regret, defined as

Ti · (µ∗ − µi ), (5.3.15)

where Ti counts the number of times we pull arm i, where µ∗ is the expected
payoff of the best arm, and µi = E [Xi ] is the expected payoff of arm i.
The tricky part here is that when we pull an arm and get a bad return,
we don’t know if we were just unlucky this time or it’s actually a bad arm.
So we have an incentive to try lots of different arms. On the other hand, the
more we pull a genuinely inferior arm, the worse our overall return. We’d like
to adopt a strategy that trades off between exploration (trying new arms)
and exploitation (collecting on the best arm so far) to do as best we can in
comparison to a strategy that always pulls the best arm.

The UCB1 algorithm Fortunately, there is a simple algorithm due to


Auer et al. [ACBF02] that solves this problem for us.16 To start with, we
pull each arm once. For any subsequent pull, suppose that for each i, we
P
have pulled the i-th arm ni times so far. Let n = ni , and let xi be the
average payoff from arm i so far. Then the UCB1 algorithm pulls the arm
that maximizes s
2 ln n
xi + . (5.3.16)
ni
UCB stands for upper confidence bound. (The “1” is because it is
the first, and simplest, of several algorithms of this general structure given in
the paper.) The idea is that we give arms that we haven’t tried very much
the benefit of the doubt, and assume that their actual average payout lies at
the upper end of some plausible range of values.17
16
This is not the only algorithm for solving multi-armed bandit problems, and it’s not
even the only algorithm in the Auer et al. paper. But it has the advantage of being
relatively straightforward to analyze. For a more general survey of multi-armed bandit
algorithms, see [BCB12] or [LS20].
17
The “bound” part is because we don’t attempt to compute the exact upper end of
this confidence interval, which may be difficult, but instead use an upper bound derived
from Hoeffding’s inequality. This distinguishes the UCB1 algorithm of [ACBF02] from the
upper confidence interval approach of Lai and Robbins [LR85] that it builds on.
CHAPTER 5. CONCENTRATION BOUNDS 96
q
The quantity 2 nlni n is a bit mysterious, but it arises in a fairly natural
way from the asymmetric version of Hoeffding’s inequality. With a small
adjustment to deal with non-zero-mean variables, (5.3.4) says that, if S is a
sum of n random variables bounded between ai and bi , then
" n #
2/
Pn
(bi −ai )2
(Xi − E [Xi ]) ≥ t ≤ e−2t
X
Pr i=1 . (5.3.17)
i=1

By applying (5.3.4) to −Xi , we also get


" n #
2/
Pn
(bi −ai )2
(Xi − E [Xi ]) ≤ −t ≤ e−2t
X
Pr i=1 . (5.3.18)
i=1

ni
Now consider xi = n1i j=1
P
X where each Xj lies between 0 and 1. This
P ij
is equivalent to having xi = nj=1 Yj where Yj = Xj /ni lies between 0 and
1/ni . So (5.3.17) says that

" s #
2 ln n 2 2
Pr xi − E [xi ] ≥ ≤ e−2( (2 ln n)/ni ) /(ni (1/ni ) )
ni
= e−4 ln n
= n−4 . (5.3.19)

We also get a similar lower bound using (5.3.18).


So the bonus applied to xi is really a high-probability bound on how big
the difference
√ between the observed payoff and expected payoff might be.
The ln n part is there to make the error probability be small as a function
of n, since we will be summing over a number of bad cases polynomial in n
and not a particular ni . Applying the bonus to all arms make it likely that
the observed payoff of the best arm stays above its actual payoff, so we won’t
forget to pull it. The hope is that over time the same bonus applied to other
arms will not boost them up so much that we pull them any more than we
have to.
However, in an infinite execution of the algorithm, even a bad arm will
be pulled infinitely often, as ln n rises enough to compensate for the last
increase in ni . This accounts for an O(log n) term in the regret, as we will
see below. It also prevents us from getting stuck refusing to give a good arm
a second chance just because we had an initial run of bad luck.

Analysis of UCB1 The following theorem is a direct quote of [ACBF02,


Theorem 1]:
CHAPTER 5. CONCENTRATION BOUNDS 97

Theorem 5.3.5 ([ACBF02]). For all K > 1, if policy UCB1 is run on K


machines having arbitrary reward distributions P1 , . . . , PK with support in
[0, 1], then its expected regret after any number n of plays is at most
  ! K 
X ln n  π2 X
8 + 1+  ∆j  . (5.3.20)
i:µi <µ∗
∆i 3 j=1

Here µi = E [Pi ] is the expected payoff for arm i, µ∗ as before is maxi µi ,


and ∆i = µ∗ − µi is the regret for pulling arm i. The theorem states that
our expected regret is bounded by a term for each arm worse that µ∗ that
grows logarithmically in n and is inversely proportional to how close the arm
is to optimal, plus a constant additional loss corresponding to pulling every
arm a little more than 4 times on average. The logarithmic regret in n is
a bit of a nuisance, but an earlier lower bound of Lai and Robbins [LR85]
shows that something like this is necessary in the limit.
To prove Theorem 5.3.5, we need to get an upper bound on the number
of times each suboptimal arm is pulled during the first n pulls. Define
s
2 ln t
ct,s = , (5.3.21)
s

the bonus given to an arm that has been pulled s times in the first t pulls.
Fix some optimal arm. Let X i,s be the average return on arm i after s

pulls and X s be the average return on the optimal arm after s pulls.
If we pull arm i after t total pulls, when arm i has previously been pulled
si times and our optimal arm has been pulled s∗ times, then we must have

X i,si + ct,si ≥ X s∗ + Ct,s∗ . (5.3.22)

This just says that arm i with its bonus looks better than the optimal
arm with its bonus.
To show that this bad event doesn’t happen, we need three things:

1. The value of X s∗ +ct,s∗ should be at least µ∗ . Were it to be smaller, the

observed value X s∗ would be more than ct,s∗ away from its expectation.
Hoeffding’s inequality implies this doesn’t happen too often.

2. The value of X i,si + ct,si should not be too much bigger than µi . We’ll
again use Hoeffding’s inequality to show that X i,si is likely to be at
most µi + ct,si , making X i,si + ct,si at most µi + 2ct,si .
CHAPTER 5. CONCENTRATION BOUNDS 98

3. The bonus ct,si should be small enough that even adding 2ct,si to µi
is not enough to beat µ∗ . This means that we need to pick si large
enough that 2ct,si ≤ ∆i . For smaller values of si , we will just accept
that we need to pull arm i a few more times before we wise up.
More formally, if none of the following events hold:

X s∗ + ct,s∗ ≤ µ∗ . (5.3.23)
X i,si ≥ µi + ct,si (5.3.24)

µ − µi < 2ct,si , (5.3.25)

then X s∗ + ct,s∗ > µ∗ > µi + 2ct,si > X i,si + ct,si , and we don’t pull arm i
because the optimal arm is better. (We don’t necessarily pull the optimal
arm, but if we don’t, it’s because we pull some other arm that still isn’t arm
i.)
For (5.3.23) and (5.3.24), we repeat the argument in (5.3.19), plugging
in t for n and si or s∗ for ni . This gives a probability of at most 2t−4 that
either or both of these bad events occur.
For (5.3.25), we need to do something a little sneaky, because the state-
ment is not actually true when si is small. So we will give `i free pulls to
arm i, and only start comparing arm i to the optimal arm after we have
done at least this many pulls. The value of `i is chosen so that, when t ≤ n
and si > `i ,
2ct,si ≤ µ∗ − µi ,
which expands to,
s
2 ln t
2 ≤ ∆i ,
si
giving
8 ln t
si ≥ .
∆2i
So we must set `i to be at least
8 ln n 8 ln t
≥ .
∆2i ∆2i
Because `i must be an integer, we actually get
& '
8 ln n 8 ln n
`i = ≤1+ .
∆2i ∆2i
CHAPTER 5. CONCENTRATION BOUNDS 99

This explains (after multiplying by the regret ∆i ) the first term in (5.3.20).
For the other sources of bad pulls, apply the union bound to the 2t−4
error probabilities we previously computed for all choices of t ≤ n, s∗ ≥ 1,
and si > `i . This gives
n X
t−1 t−1 ∞
−4
t2 · t−4
X X X
2t <2
t=1 s∗ =1 si =`i +1 t=1
π2
=2·
6
π2
= .
3
Again we have to multiply by the regret ∆i for pulling the i-th arm, which
gives the second term in (5.3.20).

5.4 Relation to limit theorems


Since in many cases we are working with sums Sn = ni=1 Xi of independent,
P

identically distributed random variables Xi , classical limit theorems apply.


These relate the behavior of Sn in the limit to the common expectation
µ = E [Xi ] and variance σ 2 = Var [Xi ].18
These include the strong law of large numbers

Pr lim Sn /n = µ = 1 (5.4.1)
n→∞

and the central limit theorem (CLT)

Sn − µn
 
lim Pr √ ≤ t = Φ(t), (5.4.2)
n→∞ σ n

where Φ is the normal distribution function.


These can sometimes be useful in the analysis if randomized algorithms,
but often are not strong enough to get the results we want. The main problem
is that the standard versions of both the strong law and the central limit
theorem say nothing about rate of convergence. So if we want (for example)
to use the CLT to show that Sn is exponentially unlikely to be nσ away from
the mean, we can’t do it directly, because nσ/σ is not a fixed constant t,
18
p
The quantity σ = Var [X] is called the standard deviation of X, and informally
gives a measure of the typical distance of X from its mean, measured in the same units as
X.
CHAPTER 5. CONCENTRATION BOUNDS 100

and for any fixed constant t, we don’t know when the limit behavior actually
starts working.
But there are variants of these theorems that do bound rate of convergence
that can be useful in some cases. An example is given in §5.5.1.

5.5 Anti-concentration bounds


It may be that for some problem you want to show that a sum of random
variables is far from its mean at least some of the time: this would be
an anti-concentration bound. Anti-concentration bounds are much less
well-understood than concentration bounds, but there are known results that
can help in some cases.
For variables where we know the distribution of the sum exactly (e.g.,
sums with binomial distributions, or sums we can attack with generating
functions), we don’t need these. But they may be useful if computing the
distribution of the sum directly is hard.

5.5.1 The Berry-Esseen theorem


The Berry-Esseen theorem19 is an extension of the central limit theorem
that characterizes how quickly a sum of independent identically-distributed
random variables converges to a normal distribution, as a function of the
third moment of the random variables. Its simplest version says that if
we have n independent, identically-distributed hrandomi variables X1 . . . Xn ,
 2 2 3
with E [Xi ] = 0, Var [Xi ] = E Xi = σ , and E |Xi | ≤ ρ, then
n
" #
1 X Cρ
sup Pr √ Xi ≤ x − Φ(x) ≤ 3 √ , (5.5.1)
−∞<x<∞ n i=1 σ n

were C is an absolute constant and Φ is the normal distribution function. Note


that the σ 3 in the denominator is really Var [Xi ]3/2 . Since the probability
bound doesn’t depend on x, it’s more useful toward the middle of the
distribution than in the tails.
A classic proof of this result with C = 3 can be found in [Fel71, §XVI.5].
Shevtsova [She11] shows a stronger bound of C < 0.4784.
As with most results involving sums of random variables, there are
generalizations to martingales. These are too involved to describe here, but
see [HH80, §3.6].
19
Sometimes written Berry-Esséen theorem to help with the pronunciation of Esseen’s
last name.
CHAPTER 5. CONCENTRATION BOUNDS 101

5.5.2 The Littlewood-Offord problem


The Littlewood-Offord problem asks, given a set of n complex numbers
x1 . . . xn with |xi | ≥ 1, for how many assignments of ±1 to coefficients
1 . . . n it holds that | ni=1 i xi | ≤ r. Paul Erdős showed [Erd45] that this
P
√ √
quantity was at mostcr2n / n, where c is a constant. The quantity c2n / n
n
here is really 12 bn/2c : Erdős’s proof shows that for each interval of length
2r, the number of assignments that give a sum in the interior of the interval
is bounded by at most the sum of the r largest binomial coefficients.
In random-variable terms, this means that if we are looking at ni=1 xi ,
P

where the xi are constants with |xi | ≥ 1 and the i are independent ±1 fair
coin-flips, then Pr [| ni=1 i xi | ≤ r] is maximized by making all the xi equal
P

to 1. This shows that any distribution where the xi are all reasonably large
will not be any more concentrated than a binomial distribution.
There has been a lot of more recent work on variants of the Littlewood-
Offord problem, much of it by Terry Tao and Van Vu. See http://terrytao.
wordpress.com/2009/02/16/a-sharp-inverse-littlewood-offord-theorem/
for a summary of much of this work.
Chapter 6

Randomized search trees

These are data structures that are either trees or equivalent to trees, and use
randomization to maintain balance. We’ll start by reviewing deterministic
binary search trees and then add in the randomization.

6.1 Binary search trees


A binary search tree is a standard data structure for holding sorted data.
A binary tree is either empty, or it consists of a root node containing a
key and pointers to left and right subtrees. What makes a binary tree a
binary search tree is the invariant that, both for the tree as a whole and
any subtree, all keys in the left subtree are less than the key in the root,
while all keys in the right subtree are greater than the key in the root. This
ordering property means that we can search for a particular key by doing
binary search: if the key is not at the root, we can recurse into the left or
right subtree depending on whether it is smaller or bigger than the key at
the root.
The efficiency of this operation depends on the tree being balanced. If
each subtree always holds a constant fraction of the nodes in the tree, then
each recursive step throws away a constant fraction of the remaining nodes.
So after O(log n) steps, we find the key we are looking for (or find that
the key is not in the tree). But the definition of a binary search tree does
not by itself guarantee balance, and in the worst case a binary search tree
degenerates into a linked list with O(n) cost for all operations (see Figure 6.2.

102
CHAPTER 6. RANDOMIZED SEARCH TREES 103

y x

x C A y

A B B C

Figure 6.1: Tree rotations

6.1.1 Rebalancing and rotations


Deterministic binary search tree implementations include sophisticated re-
balancing mechanisms to adjust the structure of the tree to preserve balance
as nodes are inserted or delete. Typically this is done using rotations,
which are operations that change the position of a parent and a child while
preserving the left-to-right ordering of keys (see Figure 6.1).
Examples include AVL trees [AVL62], where the left and right subtrees
of any node have heights that differ by at most 1; red-black trees [GS78],
where a coloring scheme is used to maintain balance; and scapegoat
trees [GR93], where no information is stored at a node but part of the
tree is rebuilt from scratch whenever an operation takes too long. These all
give O(log n) cost per operation (amortized in the case of scapegoat trees),
and vary in how much work is needed in rebalancing. Both AVL trees and
red-black trees perform more rotations than randomized rebalancing does on
average.

6.2 Random insertions


Suppose we insert n keys into an initially-empty binary search tree in random
order with no rebalancing. This means that for each insertion, we follow the
same path that we would when searching for the key, and when we reach an
empty tree, we replace it with a tree consisting solely of the key at the root.1
Since we chose a random order, each element is equally likely to be the
root, and all the elements less than the root end up in the left subtree, while
all the elements greater than the root end up in the right subtree, where
they are further partitioned recursively. This is exactly what happens in
1
This is not the only way to generate a binary search tree at random. For example, we
could instead chooseuniformly from the set of all Cn binary search trees with n nodes,
1 2n
where Cn = n+1 n
is the n-th Catalan number. For n ≥ 3, this gives a different
distribution that we don’t care about.
CHAPTER 6. RANDOMIZED SEARCH TREES 104

4 1

2 6 2

1 3 5 7 3

Figure 6.2: Balanced and unbalanced binary search trees

randomized QuickSort (see §1.3.1), so the structure of the tree will exactly
mirror the structure of an execution of QuickSort. So, for example, we
can immediately observe from our previous analysis of QuickSort that the
total path length—the sum of the depths of the nodes—is Θ(n log n),
since the depth of each node is equal to 1 plus the number of comparisons it
participates in as a non-pivot, and (using the same argument as for Hoare’s
FIND in §3.6.4) that the height of the tree is O(log n) with high probability.2
When n is small, randomized binary search trees can look pretty scraggly.
Figure 6.3 shows a typical example.
The problem with this approach in general is that we don’t have any
guarantees that the input will be supplied in random order, and in the
worst case we end up with a linked list, giving O(n) worst-case cost for all
operations.
2
The argument for Hoare’s FIND is that any node has at most 3/4 of the descendants
of its parent on average; this gives for any node x that Pr [depth(x) > d] ≤ (3/4)d−1 n, or
a probability of at most n−c that depth(x) > 1 + (c + 1) log(n)/ log(4/3) ≈ 1 + 6.952 ln n
for c = 1, which we need to apply the union bound. The right answer for the actual height
of a randomly-generated search tree in the limit is 4.31107 ln n [Dev88] so this bound
is actually pretty close. The real height is still nearly a factor of three worse than for a
completely balanced tree, which has max depth bounded by 1 + lg n ≈ 1 + 1.44269 ln n.
CHAPTER 6. RANDOMIZED SEARCH TREES 105

1 7

3 6

2 4

Figure 6.3: Binary search tree after inserting 5 1 7 3 4 6 2

6.3 Treaps
The solution to bad inputs is the same as for QuickSort: instead of assuming
that the input is permuted randomly, we assign random priorities to each
element and organize the tree so that elements with higher priorities rise
to the top. The resulting structure is known as a treap [SA96], because it
satisfies the binary search tree property with respect to keys and the heap
property with respect to priorities.3
There’s an extensive page of information on treaps at http://faculty.
washington.edu/aragon/treaps.html, maintained by Cecilia Aragon, the
co-inventor of treaps; they are also discussed at length in [MR95, §8.2]. We’ll
give a brief description here.
To insert a new node in a treap, first walk down the tree according to
the key and insert the node as a new leaf. Then go back up fixing the heap
property by rotating the new element up until it reaches an ancestor with
the same or higher priority. (See Figure 6.4 for an example.) Deletion is the
reverse of insertion: rotate a node until it has 0 or 1 children (by swapping
with its higher-priority child at each step), and then prune it out, connecting
its child, if any, directly to its parent.
Because of the heap property, the root of each subtree is always the
element in that subtree with the highest priority. This means that the
structure of a treap is completely determined by the priorities and the keys,
no matter what order the elements arrive in. We can imagine in retrospect
3
The name “treap” for this data structure is now standard but the history is a little
tricky. According to Seidel and Aragon, essentially the same data structure (though with
non-random priorities) was previously called a cartesian tree by Vuillemin [Vui80], and
the word “treap” was initially applied by McCreight to a different data structure—designed
for storing two-dimensional data—that was called a priority search tree in its published
form. [McC85].
CHAPTER 6. RANDOMIZED SEARCH TREES 106

1,60 1,60
\ \
2,3 --> 3,26
\ /
(3,26) 2,3

1,60 1,60 1,60 5,78


\ \ \ /
3,26 --> 3,26 --> (5,78) --> 1,60
/ \ / \ / \
2,3 4,24 2,3 (5,78) 3,26 3,26
\ / / \ / \
(5,78) 4,24 2,3 4,24 2,3 4,24

5,78 5,78
/ \ / \
1,60 6,18 --> 1,60 7,41
\ \ \ /
3,26 (7,41) 3,26 6,18
/ \ / \
2,3 4,24 2,3 4,24

Figure 6.4: Inserting values into a treap. Each node is labeled with k, p where
k is the key and p the priority. Insertion of values not requiring rotations
are not shown.
CHAPTER 6. RANDOMIZED SEARCH TREES 107

that the treap is constructed recursively by choosing the highest-priority


element as the root, then organizing all smaller-index and all larger-index
nodes into the left and right subtrees by the same rule.
If we assign the priorities independently and uniformly at random from a
sufficiently large set (ω(n2 ) is enough in the limit), then we get no duplicates,
and by symmetry all n! orderings are equally likely. So the analysis of the
depth of a treap with random priorities is identical to the analysis of a binary
search tree with random insertion order. It’s not hard to see that the costs
of search insertion, and deletion operations are all linear in the depth of the
tree, so the expected cost of each of these operations is O(log n).

6.3.1 Assumption of an oblivious adversary


One caveat is that this only works if the priorities of the elements of the tree
are in fact independent. If operations on the tree are chosen by an adaptive
adversary, this assumption may not work. An adaptive adversary is one
that can observe the choice made by the algorithm and react to them: in this
case, a simple strategy would be to insert elements 1, 2, 3, 4, etc., in order,
deleting each one and reinserting it until it has a lower priority value than all
the smaller elements. This might take a while for the later elements, but the
end result is the linked list again. For this reason it is standard to assume in
randomized data structures that the adversary is oblivious, meaning that
it has to specify a sequence of operations without knowing what choices are
made by the algorithm. Under this assumption, whatever insert or delete
operations the adversary chooses, at the end of any particular sequence of
operations we still have independent priorities on all the remaining elements,
and the O(log n) analysis goes through.

6.3.2 Analysis
The analysis of treaps as carried out by Seidel and Aragon [SA96] is a nice
example of how to decompose a messy process into simple variables, much
like the linearity-of-expectation argument for QuickSort (§1.3.2). The key
observation is that it’s possible to bound both the expected depth of any node
and the number of rotations needed for an insert or delete operation directly
from information about the ancestor-descendant relationship between nodes.
Define two classes of indicator variables. For simplicity, we assume that
the elements have keys 1 through n, which we also use as indices.

1. Ai,j indicates the event that i is an ancestor of j, where i is an ancestor


of j if it appears on the path from the root to j. Note that every node
CHAPTER 6. RANDOMIZED SEARCH TREES 108

is an ancestor of itself.

2. Ci;`,m indicates the event that i is a common ancestor of both ` and


m; formally, Ci;`,m = Ai,` Ai,m .

The nice thing about these indicator variables is that it’s easy to compute
their expectations.
For Ai,j , i will be the ancestor of j if and only if i has a higher priority
than j and there is no k between i and j that has an even higher prior-
ity: in other words, if i has the highest priority of all keys in the interval
[min(i, j), max(i, j)]. To see this, imagine that we are constructing the treap
recursively, by starting with all elements in a single interval and partitioning
each interval by its highest-priority element. Consider the last interval in
this process that contains both i and j, and suppose i < j (the j > i case is
symmetric). If the highest-priority element is some k with i < k < j, then i
and j are separated into distinct intervals and neither is the ancestor of the
other. If the highest-priority element is j, then j becomes the ancestor of i.
The highest-priority element can’t be less than i or greater than j, because
then we get a smaller interval that contains both i and j. So the only case
where i becomes an ancestor of j is when i has the highest priority.
1
It follows that E [Ai,j ] = |i−j|+1 , where the denominator is just the
number of elements in the range [min(i, j), max(i, j)].
For Ci;`,m , i is the common ancestor of both ` and m if and only if it is has
the highest priority in both [min(i, `), max(i, `)] and [min(i, m), max(i, m)].
It turns out that no matter what order i, `, and m come in, these intervals
overlap so that i must have the highest priority in [min(i, `, m), max(i, `, m)].
1
This gives E [Ci;`,m ] = max(i,`,m)−min(i,`,m)+1 .
CHAPTER 6. RANDOMIZED SEARCH TREES 109

6.3.2.1 Searches
− 1.4 So
P
From the Ai,j variables we can compute depth(j) = i Ai,j

n
!
X 1
E [depth(j)] = −1
i=1
|i − j| + 1
   
j n
X 1 X 1 −1
= +
i=1
j−i+1 i=j+1
i − j + 1
   
j n−j+1
X 1 X 1
= + −1
k=1
k k=2
k
= Hj + Hn−j+1 − 2.

This is maximized at j = (n + 1)/2, giving 2H(n+1)/2 − 2 = 2 ln n + O(1).


So we get the same 2 ln n + O(1) bound on the expected depth of any one
node that we got for QuickSort. We can also sum over all j to get the exact
value of the expected total path length (but we won’t). These quantities
bound the expected cost of searches.

6.3.2.2 Insertions and deletions


For insertions and deletions, the question is how many rotations we have to
perform to float a new leaf up to its proper location (after an insertion) or
to float a deleted node down to a leaf (before a deletion). Since insertion is
just the reverse of deletion, we can get a bound on both by concentrating on
deletion. The trick is to find some metric for each node that (a) bounds the
number of rotations needed to move a node to the bottom of the tree and
(b) is easy to compute based on the A and C variables
The left spine of a subtree is the set of all nodes obtained by starting
at the root and following left pointers; similarly the right spine is what we
get if we follow the right pointers instead.
When we rotate an element down, we are rotating either its left or right
child up. This removes one element from either the right spine of the left
subtree or the left spine of the right subtree, but the rest of the spines are
left intact (see Figure 6.5). Subsequent rotations will eventually remove all
these elements by rotating them above the target, while other elements in
4
We need the −1 because of the convention that the root has depth 0, making the
depth of a node one less than the number of its ancestors. Equivalently, we could exclude
j from the sum and count only proper ancestors.
CHAPTER 6. RANDOMIZED SEARCH TREES 110

4 2 6
/ \ / \ / \
*2 6* => 1 4 or 4 7
/ \ / \ / \ / \
1 *3 5* 7 *3 6* *2 5*
/ \ / \
5* 7 1 *3

Figure 6.5: Rotating 4 right shortens the right spine of its left subtree by
removing 2; rotating left shortens the left spine of the right subtree by
removing 6.

the subtree will be carried out from under the target without ever appearing
as a child or parent of the target. Because each rotation removes exactly one
element from one or the other of the two spines, and we finish when both
are empty, the sum of the length of the spines gives the number of rotations.
To calculate the length of the right spine of the left subtree of some
element `, start with the predecessor ` − 1 of `. Because there is no element
between them, either ` − 1 is a descendant of ` or an ancestor of `. In the
former case (for example, when ` is 4 in Figure 6.5), we want to include all
ancestors of ` − 1 up to ` itself. Starting with i Ai,`−1 gets all the ancestors
P

of ` − 1, and subtracting off i Ci;`−1,` removes any common ancestors of


P

` − 1 and `. Alternatively, if ` − 1 is an ancestor of `, every ancestor of


` − 1 is also an ancestor of `, so the same expression i Ai,`−1 − i Ci;`−1,`
P P

evaluates to zero.
It follows that the expected length of the right spine of the left subtree is
exactly
" n n
# n n
X X X 1 X 1
E Ai,`−1 − Ci;`−1,` = −
i=1 i=1 i=1
|i − (` − 1)| + 1 i=1 max(i, `) − min(i, ` − 1) + 1
`−1 n `−1 n
X 1 X 1 X 1 X 1
= + − −
i=1
` − i i=` i − ` + 2 i=1 ` − i + 1 i=` i − (` − 1) + 1
`−1
X 1 n−`+2
X 1 X `
1 n−`+2
X 1
= + − −
j=1
j j=2
j j=2 j j=2
j
1
=1− .
`
By symmetry, the expected length of the left spine of the right subtree is
1
1 − n−`+1 . So the total expected number of rotations needed to delete the
CHAPTER 6. RANDOMIZED SEARCH TREES 111

HEAD TAIL
33 LEVEL 2

13 33 48 LEVEL 1

13 21 33 48 75 99 LEVEL 0

Figure 6.6: A skip list. The blue search path for 99 is superimposed on an
original image from [AS07].

`-th element is
1 1
2− − ≤ 2.
` n−`+1

6.3.2.3 Other operations


Treaps support some other useful operations: for example, we can split a
treap into two treaps consisting of all elements less than and all elements
greater than a chosen pivot by rotating the pivot to the root (O(log n)
rotations on average, equal to the pivot’s expected depth as calculated in
§6.3.2.1) and splitting off the left and right subtrees. Merging two treaps with
non-overlapping keys is the reverse of this and so it has the same expected
complexity.

6.4 Skip lists


A skip list [Pug90] is a randomized tree-like data structure based on linked
lists. It consists of a level 0 list that is an ordinary sorted linked list, together
with higher-level lists that contain a random sampling of the elements at
lower levels. When inserted into the level i list, an element flips a coin that
tells it with probability p to insert itself in the level i + 1 list as well. The
result is that the element is represented by a tower of nodes, one in each
of the bottom 1 + X many layers, where X is a geometrically-distributed
random variable. An example of a small skip list is shown in Figure 6.6.
Searches in a skip list are done by starting in the highest-level list and
searching forward for the last node whose key is smaller than the target; the
search then continues in the same way on the next level down. To bound the
expected running time of a search, it helps to look at this process backwards;
the reversed search path starts at level 0 and continues going backwards
CHAPTER 6. RANDOMIZED SEARCH TREES 112

until it reaches the first element that is also in a higher level; it then jumps
to the next level up and repeats the process. The nice thing about this
reversed process is that it has a simple recursive structure: if we restrict a
skip list to only those nodes to the left of and at the same level or higher
of a particular node, we again get a skip list. Furthermore, the structure of
this restricted skip list depends only on coin-flips taken at nodes within it,
so it’s independent of anything that happens elsewhere in the full skip list.
We can analyze this process by tracking the number of nodes in the
restricted skip list described above, which is just the number of nodes in the
current level that are earlier than the current node. If we move left, this
drops by 1; if up, this drops to p times its previous value on average. So the
number of such nodes Xk after k steps satisfies the probabilistic recurrence

E [Xk+1 | Xk ] = (1 − p)(Xk − 1) + p(pXk )


= (1 − p + p2 )Xk − (1 − p)
≤ (1 − p + p2 )Xk ,

with X0 = n − 1 (since the last node is not included). Wrapping expectations


around both sides gives E [Xk+1 ] = E [E [Xk+1 | Xk ]] ≤ (1 − p + p2 ) E [Xk ],
and in general we get E [Xk ] ≤ (1 − p + p2 )k X0 < (1 − p + p2 )k n. This is
minimized at p = 1/2, giving E [Xk ] ≤ (3/4)k n, suspiciously similar to the
bound we computed before for random binary search trees.
When Xk = 0, our search is done, so if T is the time to search, we have
Pr [T ≥ k] = Pr [Xk ≥ 1] ≤ (3/4)k n, by Markov’s inequality. In particular,
if we want to guarantee that we finish with probability 1 − , we need to
run for log4/3 (n/) steps. This translates into an O(log n) bound on the
expected search time, and the constant is even the same as our (somewhat
loose) bound for treaps.
The space per element of a skip list also depends on p. Every element
needs one pointer for each level it appears in. The number of levels each
element appears in is a geometric random variable where we are waiting for
1
an event of probability 1 − p, so the expected number of pointers is 1−p .
For constant p this is O(1). However, the space cost can reduced (at the
cost of increasing search time) by adjusting p. For example, if space is at a
premium, setting p = 1/10 produces 10/9 pointers per node on average—not
much more than in a linked list—but still gives  O(log n) search time.
1
In general the trade-off is between n 1−p total expected pointers and
log1/(1−p+p2 ) (n/) search time. For small p, the number of pointers scales as
1
1+O(p), while the constant factor in the search time is − log(1−p+p 2 ) = O(1/p).
CHAPTER 6. RANDOMIZED SEARCH TREES 113

Skip lists are even easier to split or merge than treaps. It’s enough to
cut (or recreate) all the pointers crossing the boundary, without changing
the structure of the rest of the list.
Chapter 7

Hashing

The basic idea of hashing is that we have keys from a large set U , and we’d
like to pack them in a small set M by passing them through some function
h : U → M , without getting too many collisions, pairs of distinct keys x
and y with h(x) = h(y). Where randomization comes in is that we want this
to be true even if the adversary picks the keys to hash. At one extreme, we
could use a random function, but this will take a lot of space to store.1 So
our goal will be to find functions with succinct descriptions that are still
random enough to do what we want.
The presentation in this chapter is based largely on [MR95, §§8.4-8.5]
(which is in turn based on work of Carter and Wegman [CW77] on universal
hashing and Fredman, Komlós, and Szemerédi [FKS84] on O(1) worst-case
hashing); on [PR04] and [Pag06] for cuckoo hashing; and [MU17, §5.5.3] for
Bloom filters.

7.1 Hash tables


Here we review hash tables, which are implementations of the dictionary
data type mapping keys to values. The central idea is to use a random-
looking function to map keys to small numeric indices. The first published
version of this is due to Dumey [Dum56].
Suppose we want to store n elements from a universe U of in a table
with keys or indices drawn from an index space M of size m. Typically we
1
An easy counting argument shows that almost all functions from U to M take |U | log|M |
bits to represent, no matter how clever you are about choosing your representation. This
forms the basis for algorithmic information theory, which defines an object as random
if there is no way to reduce the number of bits used to express it.

114
CHAPTER 7. HASHING 115

assume U = [|U |] = {0 . . . |U | − 1} and M = [m] = {0 . . . m − 1}.


If |U | ≤ m, we can just use an array. Otherwise, we can map keys to
positions in the array using a hash function h : U → M . This necessarily
produces collisions: pairs (x, y) with h(x) = h(y), and any design of a
hash table must include some mechanism for handling keys that hash to
the same place. Typically this is a secondary data structure in each bin,
but we may also place excess values in some other place. Typical choices
for data structures are linked lists (separate chaining or just chaining)
or secondary hash tables (see §7.3 below). Alternatively, we can push
excess values into other positions in the same hash table according to some
deterministic rule (open addressing or probing) or a second hash function
(see §7.4).
For most of these techniques, the costs of insertions and searches will
depend on how likely it is that we get collisions. An adversary that knows
our hash function can always choose keys with the same hash value, but we
can avoid that by choosing our hash function randomly.2 Our ultimate goal
is to do each search in O(1 + n/m) expected time, which for n ≤ m will be
much better than the Θ(log n) time for pointer-based data structures like
balanced trees or skip lists. The quantity n/m is called the load factor of
the hash table and is often abbreviated as α.
All of this only works if we are working in a RAM (random-access machine
model), where we can access arbitrary memory locations in time O(1) and
similarly compute arithmetic operations on O(log|U |)-bit values in time O(1).
There is an argument that in reality any actual RAM machine requires either
Ω(log m) time to read one of m memory locations (routing costs) or, if one is
particularly pedantic, Ω(m1/3 ) time (speed of light + finite volume for each
location). We will ignore this argument.
We will try to be consistent in our use of variables to refer to the different
parameters of a hash table. Table 7.1 summarizes the meaning of these
variable names.

7.2 Universal hash families


A family of hash functions H is 2-universal if for any x 6= y, Pr [h(x) = h(y)] ≤
1/m for a uniform random h ∈ H. It’s strongly 2-universal if for any
2
In practice, hardly anybody every does this. Hash functions are instead chosen based
on fashion and occasional experiments, often with additional goals like cryptographic
security or speed. For cryptographic security, the SHA family is standard. For speed,
xxHash (see https://cyan4973.github.io/xxHash/) is probably the fastest widely-used
hash function with decent statistical properties.
CHAPTER 7. HASHING 116

U Universe of all keys


S⊆U Set of keys stored in the table
n = |S| Number of keys stored in the table
M Set of table positions
m = |M | Number of table positions
α = n/m Load factor

Table 7.1: Hash table parameters

x1 6= x2 ∈ U , y1 , y2 ∈ M , Pr [h(x1 ) = y1 ∧ h(x2 ) = y2 ] = 1/m2 for a uniform


random h ∈ H. Another way to describe strong 2-universality is that the val-
ues of the hash function are uniformly distributed and pairwise-independent.
For k > 2, k-universal usually means strongly k-universal: Given
distinct x1 . . . xk , and any y1 . . . yk , Pr [∀i : h(xi ) = yi ] = m−k . This is equiv-
alent to the h(xi ) values for distinct xi and randomly-chosen h having a
uniform distribution and k-wise independence. It is possible to generalize
the weak version of 2-universality to get a weak version of k-universality
(Pr [h(xi ) are all equal] ≤ m−(k−1) ), but this generalization is not as useful
as strong k-universality.
To analyze universal hash families, it is helpful to have some notation for
counting collisions. We’ll mostly be doing counting rather than probabilities
because it saves carrying around a lot of denominators. Since we are assuming
uniform choices of h we can always get back probabilities by dividing by |H|.
The particular notation we will be using follows §8.4.1 of [MR95]; the original
paper of Carter and Wegman [CW77] uses essentially the same notation with
slightly different formatting.
Let δ(x, y, h) = 1 if x =6 y and h(x) = h(y), 0 otherwise. Abusing notation,
P
we also define, for sets X, Y , and H, δ(X, Y, H) = x∈X,y∈Y,h∈H δ(x, y, h);
and allow lowercase variables to stand in for singleton sets, as in δ(x, Y, h) =
δ({x} , Y, {h}). Now the statement that H is 2-universal becomes ∀x, y :
δ(x, y, H) ≤ |H|/m; this says that only 1/m of the functions in H cause any
particular distinct x and y to collide.
If H includes all functions U → M , we get equality: a random function
gives h(x) = h(y) with probability exactly 1/m. But we might do better if
each h tends to map distinct values to distinct places. The following lemma
shows we can’t do too much better:
Lemma
  For any family H, there exist x, y such that δ(x, y, H) ≥
7.2.1.
|H| m−1
m 1 − |U |−1 .

Proof. We’ll count collisions in the inverse image of each element z. Since
CHAPTER 7. HASHING 117

all distinct pairs of elements of h−1 (z) collide with each other, we have
 
δ(h−1 (z), h−1 (z), h) = h−1 (z) · h−1 (z) − 1 .

Summing over all z ∈ M gets all collisions, giving


X   
δ(U, U, h) = h−1 (z) · h−1 (z) − 1 .
z∈M

Use convexity or Lagrange multipliers to argue that the right-hand side is


minimized subject to z h−1 (z) = |U | when all pre-images are the same
P

size |U |/m. It follows that


X |U |  |U | 
δ(U, U, h) ≥ −1
z∈M
m m
|U | |U |
 
=m −1
m m
|U |
= (|U | − m).
m
If we now sum over all h, we get
|H|
δ(U, U, H) ≥ |U |(|U | − m).
m
There are exactly |U |(|U | − 1) ordered pairs x, y for which δ(x, y, H) might
not be zero; so the Pigeonhole principle says some pair x, y has
|H| |U |(|U | − m)
 
δ(x, y, H) ≥
m |U |(|U | − 1)
|H| m−1
 
= 1− .
m |U | − 1

m−1
Since 1 − |U |−1 is likely to be very close to 1, we are happy if we get the
2-universal upper bound of |H|/m.
Why we care about this: With a 2-universal hash family, chaining using
linked lists costs O(1 + n/m) expected time per operation. The reason is
that the expected cost of an operation on some key x is proportional to the
size of the linked list at h(x) (plus O(1) for the cost of hashing itself). But
the expected size of this linked list is just the expected number of keys y in
the dictionary that collide with x, which is exactly δ(x, S, H)/|H| ≤ n/m.
CHAPTER 7. HASHING 118

7.2.1 Linear congruential hashing


Universal hash families often look suspiciously like classic pseudorandom
number generators. Here is a 2-universal hash family based on taking
remainders. It is assumed that the universe U is a subset of Zp , the integers
mod p; effectively, this just means that every element x of U satisfies 0 ≤
x ≤ p − 1.
Lemma 7.2.2. Let hab (x) = (ax + b mod p) mod m, where a ∈ Zp \ {0} , b ∈
Zp , and p is a prime ≥ m. Then {hab } is 2-universal.
Proof. Again, we count collisions. Split hab (x) as g(fab (x)) where fab (x) =
ax + b mod p and g(x) = x mod m.
The intuition is that if we fix x and y and consider all p(p − 1) pairs a, b
with a 6= 0, all p(p − 1) distinct pairs of values r = fab (x) and s = fab (y)
are equally likely. We then show that feeding these values to g produces no
more collisions than expected.
The formal statement of the intuition is that for any 0 ≤ x, y ≤ p − 1
with x 6= y, δ(x, y, H) = δ(Zp , Zp , g).
To prove this, fix x and y, and consider some pair r 6= s ∈ Zp . Then
the equations ax + b = r and ay + b = s have a unique solution for a and
b mod p (because Zp is a finite field). Furthermore this solution has a = 6 0
since otherwise fab (x) = fab (y) = b. So the function q(a, b) = hfab (x), fab (y)i
is a bijection between pairs a, b and pairs r, s. Any collisions will arise
P P
from applying g, giving δ(x, y, H) = a,b δ(x, y, hab ) = r6=s δ(r, s, g) =
δ(Zp , Zp , g).
Now we just need to count how many distinct r and s collide. There are
p choices for r. For each r, the possible s that map to the same remainder
mod m are in a set of the form {r0 , r0 + m, r0 + 2m, . . . , r0 + km} where
r0 = r mod m and r0 + km ≤ p − 1, which gives k ≤ (p − 1)/m. There are
k + 1 elements of this set, but s 6= r means we can only use k of them. This
gives at most k ≤ (p − 1)/m choices for s for each of the p choices for r.
Multiplying out these bounds gives δ(x, y, H) = δ(Zp , Zp , g) ≤ p(p − 1)/m.
1
Since each choice of a 6= 0 and b occurs with probability p(p−1) , this gives
a probability of collision of at most 1/m.

A difficulty with this hash family is that it requires doing modular


arithmetic. A faster hash is given by Dietzfelbinger et al. [DHKP97], although
it requires a slight weakening of the notion of 2-universality. For each k and
` they define a class Hk,` of functions from [2k ] to [2` ] by defining
ha (x) = (ax mod 2k ) div 2k−` ,
CHAPTER 7. HASHING 119

where x div y = bx/yc. They prove [DHKP97, Lemma 2.4] that if a is a


random odd integer with 0 < a < 2` , and x 6= y, Pr [ha (x) = ha (y)] ≤ 2−`+1 .
This increases by a factor of 2 the likelihood of a collision, but any extra
costs from this can often be justified in practice by the reduction in costs
from working with powers of 2.
If we are willing to use more randomness (and more space), a method
called tabulation hashing (§7.2.2) gives a simpler alternative that is 3-
universal.

7.2.2 Tabulation hashing


Tabulation hashing [CW77] is a method for hashing fixed-length strings
(or things that can be represented as fixed-length strings) into bit-vectors.
The description here follows Patrascu and Thorup [PT12].
Let c be the length of each string in characters, and let s be the size of
the alphabet. Initialize the hash function by constructing tables T1 . . . Tc
mapping characters to independent random bit-vectors of size lg m. Define

h(x) = T1 [x1 ] ⊕ T2 [x2 ] ⊕ . . . Tc [xc ],

where ⊕ represents bitwise exclusive OR (what ^ does in C-like languages).3


This gives a family of hash functions that is 3-wise independent but not
4-wise independent.
Like many hash algorithms, tabulation hashing was already in use before
it was formalized in general. Zobrist hashing [Zob70] is a special case of
tabulation hashing used to represent positions in board games like Chess and
Go, where Ti [xi ] gives the contribution of having a piece xi in position i. This
is useful for games whose state changes slowly because of a homomorphic
property of tabulation hashing: replacing xi by x0i while leaving all other
xj unchanged does not require recomputing the entire hash function, since
h[x0 ] = h[x] ⊕ Ti [xi ] ⊕ Ti0 [xi ] in this case.
The intuition for why the hash values might be independent is that if we
have a collection of strings, and each string brings in an element of T that
doesn’t appear in the other strings, then that element is independent of the
hash values for the other strings and XORing it with the rest of the hash
value gives a random bit string that is independent of the hash values of the
other strings. In fact, we don’t even need each string to include a unique
3
Letting m be a power of 2 and using exclusive OR is convenient on real computers. If
for some reason we don’t like this approach, the same technique, with essentially the same
analysis, works for arbitrary m if we replace bitwise XOR with addition mod m.
CHAPTER 7. HASHING 120

value; it’s enough if we can order the strings so that each string gets a value
that isn’t represented among its predecessors.
More formally, suppose we can order the strings x1 , x2 , . . . , xn that we
0
are hashing so that each has a position ij such that xjij 6= xjij for any j 0 < j,
h 0
i
then we have, for each value v, Pr h(xj ) = v h(xj ) = vj 0 , ∀j 0 < j = 1/m.
It follows that the hash values are independent:
h i n
Y h i
n
1 2
Pr h(x ) = v1 , h(x ) = v2 , . . . , h(x ) = vn ) = Pr h(xj ) = vj h(x1 ) = v1 . . . h(xj−1 ) = vj−1
j=1
1
=
mn
n
Y h i
= Pr h(xj ) = vj .
j=1

Now we want to show that when n = 3, this actually works for all possible
distinct strings x, y, and z. Let S be the set of indices i such that yi 6= xi ,
and similarly let T be the set of indices i such that zi 6= xi ; note that both
sets must be non-empty, since y 6= x and z 6= x. If S \ T is nonempty, then
(a) there is some index i in T where zi =6 xi , and (b) there is some index j in
S \ T where yi 6= xi = zi ; in this case, ordering the strings as x, z, y gives
the independence property above. If T \ S is nonempty, order them as x, y,
z instead. Alternatively, if S = T , then yi = 6 zi for some i in S (otherwise
y = z, since they both equal x on all positions outside S). In this case, xi ,
yi , and zi are all distinct.
For n = 4, we can have strings aa, ab, ba, and bb. If we take the
bitwise exclusive OR of all four hash values, we get zero, because each
character is included exactly twice in each position. So the hash values are
not independent, and we do not get 4-independence in general.
However, even though the outputs of tabulation hashing are not 4-
independent, most reasonably small sets of inputs do give independence.
This can be used to show various miraculous properties like working well for
the cuckoo hashing algorithm described in §7.4.

7.3 FKS hashing


The FKS hash table, named for Fredman, Komlós, and Szemerédi [FKS84],
is a method for storing a static set S so that we never pay more than
constant time for search (not just in expectation), while at the same time
CHAPTER 7. HASHING 121

not consuming too much space. The assumption that S is static is critical,
because FKS chooses hash functions based on the elements of S.
If we were lucky in our choice of S, we might be able to do this with
standard hashing. A perfect hash function for a set S ⊆ U is a hash
function h : U → M that is injective on S (that is, x 6= y → h(x) 6= h(y)
when x, y ∈ S). Unfortunately, we can only count on finding a perfect hash
function if m is large:
n
Lemma 7.3.1. If H is 2-universal and |S| = n with 2 ≤ m, then there is
a perfect h ∈ H for S.

Proof. Let h be chosen uniformly at random from H. For each unordered pair
x 6= y in S, let Xxy be the indicator variable for the even that h(x) = h(y),
P
and let C = x6=y Xxy be the total number of collisions in S. Each Xxy
has expectation at most 1/m, so E [C] ≤ n2 /m < 1. But we we can write
E [C] as E [C | C = 0] Pr [C = 0] + E [C | C ≥ 1] Pr [C 6= 0] ≥ Pr [C 6= 0]. So
Pr [C 6= 0] ≤ n2 /m < 1, giving Pr [C = 0] > 0. But if C is zero with nonzero


probability, there must be some h that makes it 0. That h is perfect for


S.

Essentially the same argument shows that if α n2 ≤ m, then




Pr [h is perfect for S] ≥ 1 − α. This can be handy if we want to find a


perfect hash function and not just demonstrate that it exists.
Using a perfect hash function, we get O(1) search time using O(n2 ) space.
But we can do better by using perfect hash functions only at the second
level of our data structure, which at top level will just be an ordinary hash
table. This is the idea behind the Fredman-Komlós-Szemerédi (FKS) hash
table [FKS84].
The short version is that we hash to n = |S| bins, then rehash perfectly
within each bin. The top-level hash table stores a pointer to a header for
each bin, which gives the size of the bin and the hash function used within
it. The i-th bin, containing ni elements, uses O(n2i ) space to allow perfect
hashing. The total size is O(n) as long as we can show that ni=1 n2i = O(n).
P

The time to do a search is O(1) in the worst case: O(1) for the outer hash
plus O(1) for the inner hash.

Theorem 7.3.2. The FKS hash table uses O(n) space.

Proof. Suppose we choose h ∈ H as the outer hash function, where H is


CHAPTER 7. HASHING 122

some 2-universal family of hash functions. Compute:


n
X n
X
n2i = (ni + ni (ni − 1))
i=1 i=1
= n + δ(S, S, h).

The last equality holds because each ordered pair of distinct values in S that
map to the same bucket i corresponds to exactly one collision in δ(S, S, h).
Since H is 2-universal, we have δ(S, S, H) ≤ |H| |S|(|S|−1)
n = |H| n(n−1)
n =
|H|(n − 1). But then the Pigeonhole principle says there exists some h ∈ H
1
with δ(S, S, h) ≤ |H| δ(S, S, H) = n − 1. Choosing this h gives ni=1 n2i ≤
P

n + (n − 1) = 2n − 1 = O(n).

If we want to find a good h quickly, increasing the size of the outer


table to n/α gives us a probability of at least 1 − α of getting a good one,
using essentially the same argument as for perfect hash functions. More
generally, it’s possible to adapt FKS hashing to work with dynamic data sets
by growing the main table and each subtable as needed to keep the 2-lookup
property while maintaining O(1) expected amortized cost per insertion. This
is know as dynamic perfect hashing and was studied by Dietzfelbinger et
al. [DKM+ 94]. In practice the overhead of this approach is higher than other
2-lookup schemes, such as the one described in the next section.

7.4 Cuckoo hashing


Cuckoo hashing [PR04] is a hash table implementation that uses a single
table and guarantees at most 2 probes per lookup, with O(1) expected
amortized cost per insertion.
The name comes from the cuckoo, a bird notorious for stealing space
for their own eggs in other birds’ nests. In cuckoo hashing, newly-inserted
elements may steal slots from other elements, forcing those elements to find
an alternate nest.
The formal mechanism is to use two hash functions h1 and h2 , and store
each element x in one of the two positions h1 (x) or h2 (x). This may require
moving other elements to their alternate locations to make room. But the
payoff is that each search takes only two reads, which can even be done in
parallel. This is optimal by a lower bound of Pagh [Pag01], which also shows
a matching upper bound for static dictionaries using a different technique.
Cuckoo hashing was invented by Pagh and Rodler [PR04]. The version
described here is based on a simplified version from notes of Pagh [Pag06].
CHAPTER 7. HASHING 123

The main difference is that it uses just one table instead of the two tables—one
for each hash function—in [PR04].

7.4.1 Structure
We have a table T of size n, with two separate, independent hash functions h1
and h2 . These functions are assumed to be k-universal for some sufficiently
large value k; as long as we never look at more than k values at once, this
means we can treat them effectively as random functions. In practice, using
crummy hash functions seems to work just fine, a common property of hash
tables. There are also specific hash functions that have been shown to work
with particular variants of cuckoo hashing [PR04, PT12]. We will avoid these
issues by assuming that our hash functions are actually random.
Every key x is stored either in T [h1 (x)] or T [h2 (x)]. So the search
procedure just looks at both of these locations and returns whichever one
contains x (or fails if neither contains x).
To insert a value x1 = x, we must put it in T [h1 (x1 )] or T [h2 (x1 )]. If one
or both of these locations is empty, we put it there. Otherwise we have to
kick out some value that is in the way (this is the “cuckoo” part of cuckoo
hashing, named after the bird that leaves its eggs in other birds’ nests). We
do this by letting x2 = T [h1 (x1 )] and writing x1 to T [h1 (x1 )]. We now have
a new “nestless” value x2 , which we swap with whatever is in T [h2 (x2 )]. If
that location was empty, we are done; otherwise, we get a new value x3 that
we have to put in T [h1 (x3 )] and so on. The procedure terminates when we
find an empty spot or if enough iterations have passed that we don’t expect
to find an empty spot, in which case we rehash the entire table. This process
can be implemented succinctly as shown in Algorithm 7.1.
A detail not included in the above code is that we always rehash (in
theory) after m2 insertions; this avoids potential problems with the hash
functions used in the paper not being universal enough. We will avoid this
issue by assuming that our hash functions are actually random (instead of
being approximately n-universal with reasonably high probability). For a
more principled analysis of where the hash functions come from, see [PR04].
An alternative hash family that is known to work for a slightly different
variant of cuckoo hashing is tabulation hashing, as described in §7.2.2; the
proof that this works is found in [PT12].
CHAPTER 7. HASHING 124

1 procedure insert(x)
2 if T (h1 (x) = x or T (h2 (x)) = x then
3 return
4 pos ← h1 (x)
5 for i ← 1 . . . n do
6 if T [pos] = ⊥ then
7 T [pos] ← x
8 return
9 x  T [pos]
10 if pos = h1 (x) then
11 pos ← h2 (x)
12 else
13 pos ← h1 (x)

14 If we got here, rehash the table and reinsert x.


Algorithm 7.1: Insertion procedure for cuckoo hashing. Adapted
from [Pag06]

7.4.2 Analysis
The main question is how long it takes the insertion procedure to terminate,
assuming the table is not too full.
First let’s look at what happens during an insert if we have many nestless
values. We have a sequence of values x1 , x2 , . . . , where each pair of values
xi , xi+1 collides in h1 or h2 . Assuming we don’t reach the loop limit, there
are three main possibilities (the leaves of the tree of cases below):

1. Eventually we reach an empty position without seeing the same key


twice.

2. Eventually we see the same key twice; there is some i and j > i such
that xj = xi . Since xi was already moved once, when we reach it
the second time we will try to move it back, displacing xi−1 . This
process continues until we have restored x2 to T [h1 (x1 )], displacing x1
to T [h2 (x1 )] and possibly creating a new sequence of nestless values.
Two outcomes are now possible:

(a) Some x` is moved to an empty location. We win!


(b) Some x` is moved to a location we’ve already looked at. We lose!
CHAPTER 7. HASHING 125

We find we are playing musical chairs with more players than


chairs, and have to rehash.

Let’s look at the probability that we get the last case, a closed loop.
Following the argument of Pagh and Rodler, we let v be the number of
distinct nestless keys in the loop. Since v includes x1 and at least one other
element blocking x1 from being inserted at T [h1 (x1 )], v is at least 2. We can
now count how many different ways such a loop can form, and argue that in
each case we include enough information to reconstruct h1 (ui ) and h2 (ui )
for each of a specific set of unique elements u1 , . . . uv .
Formally, this means that we are expression the closed-loop case as a
union of many specific closed loops, and then bounding the probability of
each of these specific closed-loop events by the probability of the event that
h1 and h2 select the right values to make this particular closed loop possible.
Then we apply the union bound.
To describe each of the specific events, we’ll provide this information:

• The v elements u1 , . . . uv . Since we can fix u1 = x1 , this leaves v − 1


choices from S, giving n(v−1) possibilities (we are overcounting by
allowing duplicates, but that’s not a problem for an upper bound).
We’ll require that the other ui for i > 1 appear in the list in the same
order they first appear in the sequence x1 , x2 , . . . .

• The v − 1 locations we are trying to fit these v elements into. There


are at most m(v−1) choices for these. Again we order these locations
by order of first appearance.

• The values of i, j, and `. These allow us to identify which segments of


the sequence x1 , x2 , . . . correspond to new values ui and which are old
values repeated (possibly in reverse order).
There are at most v choices for i and j (because we are still in the
initial segment with no repeats), and at most 2v choices for ` if we
count carefully (because we either land on either the initial no-duplicate
sequence starting with x1 or the second no-duplicate sequence starting
with the second occurrence of x1 ).
All together, these give 2v 3 choices.

• For each i 6= 1, whether the first occurrence of ui appears in h1 (ui )


or h2 (ui ). This gives 2v−1 choices, and allows us to correctly identify
CHAPTER 7. HASHING 126

h1 (ui ) or h2 (u1 ) from the value of ui and its first location and the other
hash value for ui given the next location in the list.4

Multiplying everything out gives at most 2v 3 (2nm)(v−1) choices of closed


loops with v unique elements. Since each particular loop allows us to
determine both h1 and h2 for all v of its elements, the probability that we
get exactly these hash values (so that the loop occurs) is m−2v . Summing
over all closed loops with v elements gives a total probability of

2v 3 (2nm)v−1 · m−2v = 2v 3 (2n)v−1 m−v−1


= 2v 3 (2n/m)v−1 m−2 .

Now sum over all v ≥ 2. We get


n ∞
m−2 2v 3 (2n/m)v−1 < m−2
X X
2v 3 (2n/m)v−1 .
v=2 v=2

The series converges if 2n/m < 1, so for any fixed α < 1/2, the probability
of any closed loop forming is O(m−2 ).
If we do hit a closed loop, then we pay O(m) time to scan the existing
table and create a new empty table, and O(n) = O(m) time on average to
reinsert all the elements into the new table, assuming that this reinsertion
process doesn’t generate any more closed loops and that the average cost of an
insertion that doesn’t produce a closed loop is O(1), which we will show below.
But the rehashing step only fails with probability O(nm−2 ) = O(m−1 ), so if
it does fail we can just try again until it works, and the expected total cost
is still O(m). Since we pay this O(m) for each insertion with probability
O(m−2 ), this adds only O(m−1 ) to the expected cost of a single insertion.
Now we look at what happens if we don’t get a closed loop. This doesn’t
force us to rehash, but if the path is long enough, we may still pay a lot to
do an insertion.
It’s a little messy to analyze the behavior of keys that appear more than
once in the sequence, so the trick used in the paper is to observe that for any
sequence of nestless keys x1 . . . xp , there is a subsequence of size p/3 with no
repetitions that starts with x1 . This will be either the sequence S1 given by
x1 . . . xj−1 —the sequence starting with the first place we try to insert x1 –or
S2 given by x1 = xi+j−1 . . . xp , the sequence starting with the second place
we try to insert x1 . Between these we have a third sequence R where we
4
The original analysis in [PR04] avoids this by alternating between two tables, so that
we can determine which of h1 or h2 is used at each step by parity.
CHAPTER 7. HASHING 127

revert some of the moves made in S1 . Because |S1 | + |R| + |S2 | ≥ p, at least
one of these three subsequences has size p/3. But |R| ≤ |S1 |, so it must be
either S1 or S2 .
We can then argue that the probability that we get a sequence of v distinct
keys in either S1 or S2 is at most 2(n/m)v−1 . The (n/m)v−1 is because we
need to hit a nonempty spot (which happens with probability at most n/m)
for the first v − 1 elements in the path, and since we assume that our hash
functions are random, the choices of these v − 1 spots are all independent.
The 2 is from the union bound over S1 and S2 . If T is length of the longer
of S1 or S2 , we get E [T ] = ∞
P∞ v−1 = O(1),
v=1 Pr [T ≥ v] ≤
P
v=1 2(n/m)
assuming n/m is bounded by a constant less than 1. Since we already need
n/m ≤ 1/2 to avoid the bad closed-loop case, we can use this here as well.
We have to multiply E [T ] by 3 to get the bound on the actual path, but this
disappears into O(1).
An annoyance with cuckoo hashing is that it has high space overhead
compared to more traditional hash tables: in order for the first part of the
analysis above to work, the table must be at least half empty. This can be
avoided at the cost of increasing the time complexity by choosing between
d locations instead of 2. This technique, due to Fotakis et al. [FPSS03], is
known as d-ary cuckoo hashing. For a suitable choice of d, it uses (1 + )n
space and guarantees that a lookup takes O(1/) probes, while insertion
takes (1/)O(log log(1/)) steps in theory and appears to take O(1/) steps in
practice according to experiments done by the authors.

7.5 Practical issues


Most hash functions used in practice do not have very good theoretical
guarantees, and indeed we have assumed in several places in this chapter that
we are using genuinely random hash functions when we would expect our
actual hash functions to be at most 2-universal. There is some justification
for doing this if there is enough entropy in the set of keys S. A proof of
this for many common applications of hash functions is given by Chung et
al. [CMV13].
Even taking into account these results, hash tables that depend on
strong properties of the hash function may behave badly if the user supplies a
crummy hash function. For this reason, many library implementations of hash
tables are written defensively, using algorithms that respond better in bad
cases. See https://svn.python.org/projects/python/trunk/Objects/
dictobject.c for an example of a widely-used hash table implementation
CHAPTER 7. HASHING 128

chosen specifically because of its poor theoretical characteristics.


For large hash tables, local probing schemes are often faster than chaining
or cuckoo hashing, because it is likely that all of the locations probed to find
a particular value will be on the same virtual memory page. This means
that a search for a new value usually requires one cache miss instead of
two. Hopscotch hashing [HST08] combines ideas from linear probing and
cuckoo hashing to get better performance than both in practice.

7.6 Bloom filters


See [MU17, §5.5.3] for basics and a formal analysis or http://en.wikipedia.
org/wiki/Bloom_filter for many variations and the collective wisdom of
the unwashed masses. The presentation here mostly follows [MU17].

7.6.1 Construction
Bloom filters are a highly space-efficient randomized data structure invented
by Burton H. Bloom [Blo70] for storing sets of keys, with a small probability
for each key not in the set that it will be erroneously reported as being in
the set.
Suppose we have k independent hash functions h1 , h2 , . . . , hk . Our mem-
ory store A is a vector of m bits, all initially zero. To store a key x, set
A[hi (x)] = 1 for all i. To test membership for x, see if A[hi (x)] = 1 for all
i. The membership test always gives the right answer if x is in fact in the
Bloom filter. If not, we might decide that x is in the Bloom filter anyway,
just because we got lucky.

7.6.2 False positives


The probability of such false positives can be computed in two steps: first,
we estimate how many of the bits in the Bloom filter are set after inserting
n values, and then we use this estimate to compute a probability that any
fixed x shows up when it shouldn’t.
If the hi are close to being independent random functions,5 then with
n entries in the filter we have Pr [A[i] = 1] = 1 − (1 − 1/m)kn , since each of
5
We are going to sidestep the rather deep swamp of how plausible this assumption
is and what assumption we should be making instead. However, it is known [KM08]
that starting with two sufficiently random-looking hash functions h and h0 and setting
hi (x) = h(x) + ih0 (x) works.
CHAPTER 7. HASHING 129

the kn bits that we set while inserting the n values has one chance in m of
hitting position i.
We’d like to simplify this using the inequality 1 + x ≤ ex , but it goes
2
in the wrong direction; instead, we’ll use 1 − x ≥ e−x−x , which holds for
0 ≤ x ≤ 0.683803 and in our application holds for m ≥ 2. This gives

Pr [A[i] = 1] ≤ 1 − (1 − 1/m)kn
≤ 1 − e−kn(1/m)(1+1/m)
= 1 − e−kα(1+1/m)
0
= 1 − e−kα

where α = n/m is the load factor and α0 = α(1 + 1/m) is the load factor
fudged upward by a factor of 1 + 1/m to make the inequality work.
Suppose now that we check to see if some value x that we never inserted
in the Bloom filter appears to be present anyway. This occurs if A[hi (x)] = 1
for all i. Since each A[hi (x)] is an independent sample from A, the probability
that they all come up 1 conditioned on A is
P k
A[i]
. (7.6.1)
m
 0

We have an upper bound E [ A[i]] ≤ m 1 − e−kα , and if we were born
P

luckier, we might be able to get an upper bound on the expectation of (7.6.1)


by applying Jensen’s inequality to the function f (x) = xk . But sadly this
inequality also goes in the wrong direction, because f is convex for k > 1.
P
So instead we will prove a concentration bound on S = A[i].
Because the A[i] are not independent, we can’t use off-the-shelf Chernoff
bounds. Instead, we rely on McDiarmid’s inequality. Our assumption is
that the locations of the kn ones that get written to A are independent.
Furthermore, changing the location of one of these writes changes S by at most
2
1. So McDiarmid’s inequality (5.3.13)
q gives Pr [S ≥ E [S] + t] ≤ e−2t /kn ,
which is bounded by n−c for t ≥ 12 ckn log n. So as long as a reasonably
large fraction of the array is likely to be full, the relative error from assuming
S = E [S] is likely to be small. Alternatively, if the array is mostly empty,
then we don’t care about the relative error so much because the probability
of getting a false positive will already be exponentially small as a function of
k.
So let’s assume for simplicity that our false positive probability is exactly
0
(1 − e−kα )k . We can choose k to minimize this quantity for fixed α0 by
CHAPTER 7. HASHING 130

doing the usual trick of taking a derivative and setting it to zero; to avoid
weirdness with the k in the exponent, it helps to take the logarithm first
(which doesn’t affect the location of the minimum), and it further helps to
0
take the derivative with respect to x = e−α k instead of k itself. Note that
when we do this, k = − α10 ln x still depends on x, and we will deal with this
by applying this substitution at an appropriate point.
Compute
d   d
ln (1 − x)k = k ln(1 − x)
dx dx 
d 1

= − 0 ln x ln(1 − x)
dx α
1 ln(1 − x) ln x
 
=− 0 − .
α x 1−x

Setting this to zero gives (1 − x) ln(1 − x) = x ln x, which by symmetry


has the unique solution x = 1/2, giving k = α10 ln 2.
In other words, to minimize the false positive rate for a known load factor
α, we want to choose k = α10 ln 2 = α(1+1/m)
1
ln 2, which makes each bit one
with probability approximately 1 − e − ln 2 = 12 . This makes intuitive sense,
since having each bit be one or zero with equal probability maximizes the
entropy of the data.
0
The probability of a false positive for a given key is then 2−k = 2− ln 2/α .
For a given maximum false positive rate , and assuming optimal choice of
ln2 2 ln2 2
k, we need to keep α0 ≤ ln(1/) or α ≤ (1+1/m) ln(1/) .
Another way to look at this is that if we fix  and n, we need m/(1 +
1/m) ≥ n · ln(1/)
ln2 2
≈ 1.442695 · n lg(1/), which works out to m ≥ 1.442695 ·
n lg(1/) + O(1). This is very good for constant .
Note that for this choice of m, we have α = O(1/ ln(1/)), giving k =
O(log(1/)). So for polynomial , we get k = O(log n). This is closer to the
complexity of tree lookups than hash table lookups, so the main payoff for a
sequential implementation is that we don’t have to store full keys.

7.6.3 Comparison to optimal space


If we wanted to design a Bloom-filter-like data structure from scratch and
had no constraints on processing power, we’d be looking for something that
stored an index of size lg M into a family of subsets S1 , S2 , . . . SM of our
universe of keys U , where |Si | ≤ |U | for each i (giving the upper bound on
CHAPTER 7. HASHING 131

the false positive rate)6 and for any set A ⊆ U of size n, A ⊆ Si for at least
one Si (allowing us to store A).
Let N = |U |. Then each set Si covers N N

n of the n subsets of size n. If
we could. get them to overlap optimally (we can’t), we’d still need a minimum
N N
of n n = (N )n /(N )n ≈ (1/)n sets to cover everybody, where the
approximation assumes N  n. Taking the log gives lg M ≈ n lg(1/),
meaning we need about lg(1/) bits per key for the data structure. Bloom
filters use 1/ ln 2 times this.
There are known data structures that approach this bound asymptotically.
The first of these, due to Pagh et al. [PPR05] also has other desirable
properties, like supporting deletions and faster lookups if we can’t look up
bits in parallel.
More recently, Fan et al. [FAKM14] have described a variant of cuckoo
hashing (see §7.4) called a cuckoo filter. This is a cuckoo hash table that,
instead of storing full keys x, stores fingerprints f (x), where f is a hash
function with `-bit outputs. False positives now arise if we happen to hash a
value x0 with f (x0 ) = f (x) to the same location as x. If f is drawn from a
2-universal family, this occurs with probability at most 2−` . So the idea is
that by accepting an  small rate of false positives, we can shrink the space
needed to store each key from the full key length to lg(1/) = ln(1/)/ ln 2,
the asymptotic minimum.
One complication is that, since we are throwing away the original key x,
when we displace a key from h1 (x) to h2 (x) or vice versa, we can’t recompute
h1 (x) and h2 (x) for arbitrary h1 and h2 . The solution proposed by Fan et al.
is to let h2 (x) = h1 (x) ⊕ g(f (x)), where g is a hash function that depends
only on the fingerprint. This means that when looking at a fingerprint f (x)
stored in position i, we don’t need to know whether i is h1 (x) or h2 (x), since
whichever it is, the other location will be i ⊕ g(f (x)). Unfortunately, this
technique and some other techniques used in the paper to crunch out excess
empty space break the standard analysis of cuckoo hashing, so the authors
can only point to experimental evidence that their data structure actually
6
Technically, this gives a weaker bound on false positives. For standard Bloom filters,
assuming random hash functions, each key individually has at most an  probability
of appearing as a false positive. The hypothetical data structure we are considering
here—which is effectively deterministic—allows the set of false positives to depend directly
on the set of keys actually inserted in the data structure, meaning that the adversary
could arrange for a specific key to appear as a false positive with probability 1 by choosing
appropriate keys to insert. So this argument may underestimate the space needed to get
make the false positives less predictable. On the other hand, we aren’t charging the Bloom
filter for the space needed to store the hash functions, which could be quite a bit if they
are genuine random functions.
CHAPTER 7. HASHING 132

works. However, a variant of this data structure has been shown to work by
Eppstein [Epp16].

7.6.4 Applications
Historically, Bloom filters were invented to act as a way of filtering queries
to a database table through fast but expensive7 RAM before looking up the
actual values on a slow but cheap tape drive. Nowadays the cost of RAM is
low enough that this is less of an issue in most cases, but Bloom filters are
still popular in networking and in distributed databases.
In networking, Bloom filters are useful in building network switches, where
incoming packets need to be matched against routing tables in fractions of a
nanosecond. Bloom filters work particularly well for this when implemented
in hardware, since the k hash functions can be computed in parallel. False
positives, if infrequent enough, can be handled by some slower backup
mechanism.
In distributed databases, Bloom filters are used in the Bloomjoin algo-
rithm [ML86]. Here we want to do a join on two tables stored on different
machines (a join is an operation where we find all pairs of rows, one in each
table, that match on some common key). A straightforward but expensive
way to do this is to send the list of keys from the smaller table across the
network, then match them against the corresponding keys from the larger
table. If there are ns rows in the smaller table, nb rows in the larger table,
and j matching rows in the larger table, this requires sending ns keys plus j
rows. If instead we send a Bloom filter representing the set of keys in the
smaller table, we only need to send n lg(1/)/ ln 2 bits for the Bloom filter
plus an extra nb rows on average for the false positives. This can be cheaper
than sending full keys across if the number of false positives is reasonably
small.

7.6.5 Counting Bloom filters


It’s not hard to modify a Bloom filter to support deletion. The basic trick is
to replace each bit with a counter, so that whenever a value x is inserted, we
increment A[hi (x)] for all i and when it is deleted, we decrement the same
locations. The search procedure now returns mini A[hi (x)] (which means
that it principle it can even report back multiplicities, though with some
probability of reporting a value that is too high). To avoid too much space
overhead, each array location is capped at some small maximum value c;
7
As much as $0.10/bit in 1970.
CHAPTER 7. HASHING 133

once it reaches this value, further increments have no effect. The resulting
structure is called a counting Bloom filter, due to Fan et al. [FCAB00].
We can only expect this to work if our chance of hitting the cap is small.
Fan et al. observe that the probability that the m table entries include one
that is at least c after n insertions is bounded by
! c
nk 1 enk 1

m c
≤m
c m c mc
enk c
 
=m
cm
= m(ekα/c)c .
k
(This uses the bound nk ≤ en

k , which follows from Stirling’s formula.)
For k = α1 ln 2, this is m(e ln 2/c)c . For the specific value of c = 16
(corresponding to 4 bits per entry), they compute a bound of 1.37 × 10−15 m,
which they argue is minuscule for all reasonable values of m (it’s a systems
paper).
The possibility that a long chain of alternating insertions and deletions
might produce a false negative due to overflow is considered in the paper,
but the authors state that “the probability of such a chain of events is so
low that it is much more likely that the proxy server would be rebooted in
the meantime and the entire structure reconstructed.” An alternative way of
dealing with this problem is to never decrement a maxed-out register. This
never produces a false negative, but may cause the filter to slowly fill up
with maxed-out registers, producing a higher false-positive rate.
A fancier variant of this idea is the spectral Bloom filter of Cohen
and Matias [CM03], which uses larger counters to track multiplicities of
items. The essential idea here is that we can guess that the number of times
a particular value x was inserted is equal to minki=1 A[hi (x)]), with some
extra tinkering to detect errors based on deviations from the typical joint
distribution of the A[hi (x)] values. An even more sophisticated approach
gives the count-min sketches of the next section.

7.7 Data stream computation


In the data stream model, we are given a huge flood of data—far too big
to store—in a single pass, and want to incrementally build a small data
structure, called a sketch, that will allow us to answer statistical questions
about the data after we’ve processed it all. The motivation is the existence
CHAPTER 7. HASHING 134

of data sets that are too large to store at all (network traffic statistics), or
too large to store in fast memory (very large database tables). By building
an appropriate small data structure using a single pass through the data,
we can still answer queries about with some loss of accuracy. Examples we
will consider include estimating the size of a set presented over time with
possible duplicate elements (§7.7.1) or more general statistical queries based
on aggregate counts of some sort (§7.7.2).
In each of these cases, the answers we get will be approximate. We
will measure the quality of the approximation in terms of parameters (δ, ),
where we demand a relative error of at most  with probability at least 1 − δ.
We’d also like our data structure to have size at most polylogarithmic in the
number of samples n and polynomial in δ and .

7.7.1 Cardinality estimation


The cardinality estimation or count-duplicates problem involves seeing
a sequence of values x1 , x2 , . . . , xn and asking to compute the number of
unique values in this sequence.
Without the uniqueness constraint, this is trivial: just keep a counter.
With the uniqueness constraint, exact counting is much harder, since any
data structure that lets us detect if we see a new element also lets us test for
membership. But if we are willing to accept an approximation, we can get
around this by using hashing and then tracking statistical properties of the
hashed values. To keep the analysis simple, we will assume that our hashing
function is a random function, and not charge for storing its parameters.
Many algorithms of this type are based on a tool called a Flajolet-
Martin sketch [FNM85]. The simplest version of this is that we pick
a random hash function h, and use it to generate a geometric random
variable Ri for each xi by counting the number of trailing zeroes in the
binary representation of h(xi ). We then track R = maxi Ri and estimate the
number n of unique xi using 2R .
The intuition for why this counts unique xi is that sending in the same
xi twice produces the same Ri = h(xi ) both times, so the maximum R is
unaffected.
It’s easy to show that this gives a reasonably good approximation to
n. For the upper bound, given n samples, the expected number of samples
with Ri ≥ k is n · 2−k , so for k = lg n + `, the probability that 2R ≥ 2k
is at most 2−` by Markov’s inequality. For the lower bound, we have
−k
Pr [Ri < k] = 1 − 2−k so Pr [∀i : Ri < k] = (1 − 2−k )n ≤ e−n2 , which gives
`
Pr [max Ri < lg n − `] ≤ e−2 . The lower bound is a bit stronger than the
CHAPTER 7. HASHING 135

upper bound, but in both directions we get at least a constant probability of


being within a (large) constant factor of the correct answer. The expected
value can also be shown to converge to the correct value for large enough
n, after multiplying by a small correction factor φ that compensates for
systematic round-off error caused by quantizing to powers of 2.
To improve the constants, Flajolet and Martin proposed a technique they
called Probabilistic Counting with Stochastic Averaging or PCSA.
This splits the incoming stream into m buckets using a second hash function
(or, in practice, the leading bits of the first one). Each bucket gives its own
estimate n̂i = mφ2R , then these estimates are averaged to produce to final
estimate. Flajolet and Martin show that, with an appropriate multiplier, this

estimate has a typical error of O(1/ m). This analysis is pretty involved so
we will not repeat it here.
The original PCSA algorithm is not used much in practice. More popular
is a descendant called HyperLogLog [FFGM07] that replaces the arithmetic
1 P
mean m i n̂i with a harmonic mean

1
1 P 1 .
m i n̂i

As with PCSA, HyperLogLog requires using some carefully calculated


corrections to get an unbiased estimator. This can be avoided using an
auxiliary counter that is updated whenever the main data structure changes,
which also gives some improvement in the accuracy of the estimate. This
mechanism was originally proposed independently by Cohen [Coh15] and
Ting [Tin14], although the description we give here is largely drawn from a
more recent paper by Pettie et al. [PWY20], which refers to this approach
as the martingale transformation of the original data structure.
What this transformation does is observe that for each state s of the
HyperLogLog (or whatever) data structure, there is a probability ps that
the next unique element will send it to a new state s0 . If we can calculate
this probability, then we can update our estimated count λ̂ by increasing it
by 1/ps when the change occurs. Because each new unique element gives an
expected increase of exactly ps · p1s = 1, this makes the expected value in the
counter exactly equal to the actual count. (The connection between this and
martingales is that we just showed that λ̂t − λt is a martingale where λ̂t is
the value of the counter after t steps and λt is the actual number of unique
elements.)
The only tricky part here is computing ps . For HyperLogLog, there is a
1/m chance that our new unique element lands in each of the m buckets, and
CHAPTER 7. HASHING 136

it lands in a bucket that currently stores ri , the probability that we increase


ri is exactly 2−ri −1 . This immediately gives
1 X −ri −1
ps = 2
m i

and thus
1 1
= 1 −ri −1
,
ps
P
m i2

which looks suspiciously like the harmonic mean used on the final estimates
in HyperLogLog. As with the original HyperLogLog, it is possible to show

that the typical relative error for this sketch is O(1/ m). See [PWY20] for
more details and some further improvements.
If we don’t care about practical engineering issues, there is a known asymp-
totically optimal solution to the cardinality estimation problem [KNW10],
which doesn’t even require assuming a random oracle, but the constants give
worse performance than the systems that people actually use.

7.7.2 Count-min sketches


A count-min sketch is used for the case where we are presented with a
sequence of pairs (it , ct ) where 1 ≤ it ≤ n is an index and ct is a count, and
we want to construct a sketch that will allows us to approximately answer
P
statistical queries about the vector a given by ai = t,it =i ct . These were
developed by Cormode and Muthukrishnan [CM05], although some of the
presentation here is based on [MU17, §15.4]. Structurally, they are related
to counting Bloom filters (§7.6.5).
Note that we are no longer interested in detecting unique values, but
instead want to avoid the cost of storing the entire vector a when most of
its components are small or zero. The goal is that the size of the sketch
should be polylogarithmic in the size of a and the length of the stream, and
polynomial in the error bounds. We also want updating the sketch for each
new data point to be cheap.
The Cormode-Muthukrishnan count-min sketch is fairly versatile, giving
approximations of ai , ri=` ai , and a · b (for any fixed b), and it can also be
P

used for more complex tasks like finding heavy hitters—indices with high
weight. The easiest case is approximating ai when all the ct are non-negative,
so we’ll start with that.
CHAPTER 7. HASHING 137

7.7.2.1 Initialization and updates


To construct a count-min sketch, build a two-dimensional array c with depth
d = dln(1/δ)e and width w = de/e, where  is the error bound and δ is
the probability of exceeding the error bound. Choose d independent hash
functions from some 2-universal hash family; we’ll use one of these hash
function for each row of the array. Initialize c to all zeros.
The update rule: Given an update (it , ct ), increment c[j, hj (it )] by ct for
j = 1 . . . d. (This is the count part of count-min.)

7.7.2.2 Queries
Let’s start with point queries. Here we want to estimate ai for some
fixed i. There are two cases; the first handles non-negative increments only,
while the second handles arbitrary increments. In both cases we will get an
estimate whose error is linear in both the error parameter  and the `1 -norm
kak1 = i |ai | of a. It follows that the relative error will be low for heavy
P

points, but we may get a large relative error for light points (and especially
large for points that don’t appear in the data set at all).
For the non-negative case, to estimate ai , compute âi = minj c[j, hj (i)].
(This is the min part of coin-min.) Then:

Lemma 7.7.1. When all ct are non-negative, for âi as defined above:

âi ≥ ai , (7.7.1)

and

Pr [âi ≤ ai + kak1 ] ≥ 1 − δ. (7.7.2)

Proof. The lower bound is easy. Since for each pair (i, ct ) we increment each
c[j, hj (i)] by ct , we have an invariant that ai ≤ c[j, hj (i)] for all j throughout
the computation, which gives ai ≤ âi = minj c[j, hj (i)].
For the upper bound, let Iijk be the indicator for the event that (i 6=
k) ∧ (hj (i) = hj (k)), i.e., that we get a collision between i and k using hj .
The 2-universality property of the hj gives E [Iijk ] ≤ 1/w ≤ /e.
Now let Xij = nk=1 Iijk ak . Then c[j, hj (i)] = ai + Xij . (The fact that
P

Xij ≥ 0 gives an alternate proof of the lower bound.) Now use linearity of
CHAPTER 7. HASHING 138

expectation to get
" n #
X
E [Xij ] = E Iijk ak
k=1
n
X
= ak E [Iijk ]
k=1
Xn
≤ ak (/e)
k=1
= (/e)kak1 .

So Pr [c[j, hj (i)] > ai + kak1 ] = Pr [Xij > e E [Xij ]] < 1/e, by Markov’s
inequality. With d choices for j, and each hj chosen independently, the prob-
ability that every count is too big is at most (1/e)d = e−d ≤ exp(− ln(1/δ)) =
δ.

Now let’s consider the general case, where the increments ct might be
negative. We still initialize and update the data structure as described in
§7.7.2.1, but now when computing âi , we use the median count instead of
the minimum count: âi = median {c[j, hj (i)] | j = 1 . . . n}. Now we get:
Lemma 7.7.2. For âi as defined above,

Pr [ai − 3kak1 ≤ âi ≤ ai + 3kak1 ] > 1 − δ 1/4 . (7.7.3)

Proof. The basic idea is that for the median to be off by t, at least d/2
rows must give values that are off by t. We’ll show that for t = 3kak1 , the
expected number of rows that are off by t is at most d/8. Since the hash
functions for the rows are chosen independently, we can use Chernoff bounds
to show that with a mean of d/8, the chances of getting all the way to d/2
are small.
In detail, we again define the error term Xij as above, and observe that
" #
X
E [|Xij |] = E Iijk ak
k
n
X
≤ |ak E [Iijk ]|
k=1
Xn
≤ |ak |(/e)
k=1
= (/e)kak1 .
CHAPTER 7. HASHING 139

Using Markov’s inequality, we get Pr [|Xij |] > 3kak1 ] = Pr [|Xij | > 3e E [Xij ]] <
1/3e < 1/8. In order for the median to be off by more than 3kak1 , we need
d/2 of these low-probability events to occur. The expected number that
occur is µ = d/8, so applying the standard Chernoff bound (5.2.1) with δ = 3
we are looking at

Pr [S ≥ d/2] = Pr [S ≥ (1 + 3)µ]
≤ (e3 /44 )d/8
≤ (e3/8 /2)ln(1/δ)
= δ ln 2−3/8
< δ 1/4 .

(The actual exponent is about 0.31, but 1/4 is easier to deal with). This
immediately gives (7.7.3).

One way to think about this is that getting an estimate within kak1 of
the right value with probability at least 1 − δ requires 3 times the width and
4 times the depth—or 12 times the space and 4 times the time—when we
aren’t assuming increments are non-negative.
Next, we consider inner products. Here we want to estimate a · b, where
a and b are both stored as count-min sketches using the same hash functions.
The paper concentrates on the case where a and b are both non-negative,
which has applications in estimating the size of a join in a database. The
method is to estimate a · b as minj w k=1 ca [j, k] · cb [j, k].
P

For a single j, the sum consists of both good values and bad collisions; we
have w
Pn
k=1 ca [j, k] · cb [j, k] =
P P
k=1 ai bi + p6=q,hj (p)=hj (q) ap bq . The second
term has expectation
X X
Pr [hj (p) = hj (q)] ap bq ≤ (/e)ap bq
p6=q p6=q
X
≤ (/e)ap bq
p,q

≤ (/e)kak1 kbk1 .

As in the point-query case, we get probability at most 1/e that a single j


gives a value that is too high by more than kak1 kbk1 , so the probability
that the minimum value is too high is at most e−d ≤ δ.
CHAPTER 7. HASHING 140

7.7.2.3 Finding heavy hitters


Here we want to find the heaviest elements in the set: those indices i for
which ai exceeds φkak1 for some constant threshold 0 < φ ≤ 1. We assume
that all ct are non-negative. Because kak1 = i ai , we know that there will
P

be at most 1/φ heavy hitters. But the tricky part is figuring out which
elements they are.
The output at any stage will be approximate in the following sense:
it is guaranteed that any i such that ai ≥ φkak1 is included, and each i
with ai < (φ − ) that previously appeared in the stream is included with
probability at most 1 − δ. This is similar to what we would get if we just
ran a point query on all possible i, but (a) there are many possible i and (b)
we won’t ever output an i we’ve never seen.
The trick is to extend the data structure and update procedure to track
all the heavy elements found so far (stored in a heap, with the minimum
estimate at the top), as well as kak1 = ct . When a new increment (i, c)
P

comes in, we first update the count-min structure and then do a point query
on ai ; if âi ≥ φkak1 , we insert i into the heap. We also delete any elements
at the top of the heap that have a point-query estimate below threshold.
Because âi ≥ ai , every heavy hitter is correctly identified. However, it’s
possible that an index stops being a heavy hitter at some point (because the
threshold φkak1 rose since we included it). In this case it may get removed
from the heap, but if it becomes a heavy hitter again, we’ll put it back.

7.8 Locality-sensitive hashing


Locality-sensitive hashing was invented by Indyk and Motwani [IM98] to
solve the problem of designing a data structure that finds approximate nearest
neighbors to query points in high dimension. We’ll mostly be following this
paper in this section, concentrating on the hashing parts.

7.8.1 Approximate nearest neighbor search


In the nearest neighbor search problem (NNS for short), we are given a
set of n points P in a metric space with distance function d, and we want
to construct a data structure that allows us to quickly find the closet point
p in P to any given query point q. We could always compute the distance
between q and each possible p, but this takes time O(n), and we’d like to
get lookups to be sublinear in n.
CHAPTER 7. HASHING 141

Indyk and Motwani were particularly interested in what happens in Rd


for high dimension d under various natural metrics. Because the volume of
a ball in a high-dimensional space grows exponentially with the dimension,
this problem suffers from the curse of dimensionality [Bel57]: simple
techniques based on, for example, assigning points in P to nearby locations
in a grid may require searching exponentially many grid locations. Indyk
and Motwani deal with this through a combination of randomization and
solving the weaker problem of -nearest neighbor search (-NNS), where
it’s acceptable to return a different point p0 as long as d(q, p0 ) ≤ (1 +
) minp∈P d(q, p).
This problem can be solved by reduction to a simpler problem called
-point location in equal balls or -PLEB. In this problem, we are given
n radius-r balls centered on points c in a set C, and we want a data structure
that returns a point c0 ∈ C with d(q, c0 ) ≤ (1 + )r if there is at least one
point c with d(q, c) ≤ r. If there is no such point, the data structure may or
may not return a point (it might say no, or it might just return a point that
is too far away, which we can discard). The difference between an -PLEB
and NNS is that an -PLEB isn’t picky about returning the closest point to
q if there are multiple points that are all good enough. Still, we can reduce
NNS to -PLEB.
maxx,y∈P d(x,y)
The easy reduction is to use binary search. Let R = minx,y∈P,x6 =y d(x,y)
.
Given a point q, look for the minimum ` ∈ (1 + )0 , (1 + )1 , . . . , R for


which an -PLEB data structure with radius ` and centers P returns a point
p with d(q, p) ≤ (1 + )`; then return this point as the approximate nearest
neighbor.
This requires O(log1+ R) instances of the -PLEB data structure and
O(log log1+ R) queries. The blowup as a function of R can be avoided using
a more sophisticated data structure called a ring-cover tree, defined in the
paper. We won’t talk about ring-cover trees because they are (a) complicated
and (b) not randomized. Instead, we’ll move directly to the question of how
we solve -PLEB.

7.8.2 Locality-sensitive hash functions


Definition 7.8.1 ([IM98]). A family of hash functions H is (r1 , r2 , p1 , p2 )-
sensitive for d if, for any points p and q, if h is chosen uniformly from
H,
1. If d(p, q) ≤ r1 , then Pr [h(p) = h(q)] ≥ p1 , and
2. If d(p, q) > r2 , then Pr [h(p) = h(q)] ≤ p2 .
CHAPTER 7. HASHING 142

These are useful if p1 > p2 and r1 < r2 ; that is, we are more likely to hash
inputs together if they are closer. Ideally, we can choose r1 and r2 to build
-PLEB data structures for a range of radii sufficient to do binary search as
described above (or build a ring-cover tree if we are doing it right). For the
moment, we will aim for an (r1 , r2 )-PLEB data structure, which returns a
point within r1 with high probability if one exists, and never returns a point
farther away than r2 .
There is some similarity between locality-sensitive hashing and a more gen-
eral dimension-reduction technique known as the Johnson-Lindenstrauss
lemma [JL84]; this says that projecting n points in a high-dimensional space
to O(−2 log n) dimensions using an appropriate random matrix preserves
`2 distances between the points to within relative error  (in fact, even a
random matrix with entries in {−1, 0, +1} is enough [Ach03]). Unfortunately,
dimension reduction by itself is not enough to solve approximate nearest
neighbors in sublinear time, because we may still need to search a number of
boxes exponential in O(−2 log n), which will be polynomial in n. But we’ll
look at the Johnson-Lindenstrauss lemma and its many other applications
more closely in Chapter 8.

7.8.3 Constructing an (r1 , r2 )-PLEB


The first trick is to amplify the difference between p1 and p2 so that we can
find a point within r1 of our query point q if one exists. This is done in three
stages: First, we concatenate multiple hash functions to drive the probability
that distant points hash together down until we get few collisions: the idea
here is that we are taking the AND of the events that we get collisions in the
original hash function. Second, we hash our query point and target points
multiple times to bring the probability that nearby points hash together up:
this is an OR. Finally, we iterate the procedure to drive down any remaining
probability of failure below a target probability δ: another AND.
For the first stage, let k = log1/p2 n and define a composite hash function
log1/p n
g(p) = (h1 (p) . . . hk (p)). If d(p, q) > r2 , Pr [g(p) = g(q)] ≤ pk2 = p2 2
=
1/n. Adding this up over all n points in our data structure gives us one false
match for q on average.
However, we may not be able to find the correct match for q, since p1
may not be all that much larger than p2 . For this, we do a second round of
amplification, where now we are taking the OR of events we want instead of
the AND of events we don’t want.
Let ` = nρ , where ρ = log(1/p 1) log p1
log(1/p2 ) = log p2 < 1, and choose hash functions
g1 . . . g` independently as above. To store a point p, put it in a bucket for
CHAPTER 7. HASHING 143

gj (p) for each j; these buckets are themselves stored in a hash table (by
hashing the value of gj (p) down further) so that they fit in O(n) space.
Suppose now that d(p, q) ≤ r1 for some p. Then

Pr [gj (p) = gj (q)] ≥ pk1


log1/p2 n
= p1
log 1/p
− log 1/p1
=n 2

−ρ
=n
= 1/`.

So by searching through ` independent buckets we find p with probability


at least 1 − (1 − 1/`)` = 1 − 1/e + o(1). We’d like to guarantee that we
only have to look at O(nρ ) points (most of which we may reject) during this
process; but we can do this by stopping if we see more than 2` points. Since
we only expect to see ` bad points in all ` buckets, this event only happens
with probability 1/2. So even adding it to the probability of failure from the
hash functions not working right we still have only a constant probability of
failure 1/e + 1/2 + o(1).
Iterating the entire process O(log(1/δ)) times then gives the desired
bound δ on the probability that this process fails to find a good point if one
exists.
 costs gives a cost of a query of O(k` log(1/δ)) =
 Multiplying out all the
ρ
O n (log1/p2 n) log(1/δ) hash function evaluations and O(nρ log(1/δ)) dis-
tance
 computations. The  cost to insert a point is just O(k` log(1/δ)) =
ρ
O n (log1/p2 n) log(1/δ) hash function evaluations, the same number as for
a query.

7.8.4 Hash functions for Hamming distance


Suppose that our points are d-bit vectors and that we use Hamming dis-
tance for our metric. In this case, using the family of one-bit projections
{hi | hi (x) = xi } gives a locality-sensitive hash family [ABMRT96]. 
Specifically, we can show this family is r, r(1 + ), 1 − dr , 1 − r(1+)
d -
sensitive. The argument is trivial: if two points p and q are at distance r
or less, they differ in at most r places, so the probability that they hash
together is just the probability that we don’t pick one of these places, which
is at least 1 − dr . Essentially the same argument works when p and q are far
away.
CHAPTER 7. HASHING 144

These are not particularly clever hash functions, so the heavy lifting will
be done by the (r1 , r2 )-PLEB construction. Our goal is to build an -PLEB
for any fixed r, which will correspond to an (r, r(1 + ))-PLEB. The main
thing we need to do, following [IM98] as always, is compute a reasonable
p1 ln(1−r/d)
bound on ρ = log log p2 = ln(1−(1+)r/d) . This is essentially just a matter of
hitting it with enough inequalities, although there are a couple of tricks in
the middle.
Compute

ln(1 − r/d)
ρ=
ln(1 − (1 + )r/d)
(d/r) ln(1 − r/d)
=
(d/r) ln(1 − (1 + )r/d)
ln((1 − r/d)d/r )
=
ln((1 − (1 + )r/d)d/r )
ln(e−1 (1 − r/d))

ln e−(1+)
−1 + ln(1 − r/d)
=
−(1 + )
1 ln(1 − r/d)
= − . (7.8.1)
1+ 1+
Note that we used the fact that 1 + x ≤ ex for all x in the denominator
and (1 − x)1/x ≥ e−1 (1 − x) for x ∈ [0, 1] in the numerator. The first fact is
our usual favorite inequality.
The second can be proved in a number of ways. The most visually
intuitive is that (1 − x)1/x and e−1 (1 − x) are equal at x = 1 and equal
in the limit as x goes to 0, while (1 − x)1/x is concave in between 0 and
1 and e−1 (1 − x) is linear. Unfortunately it is rather painful to show that
(1 − x)1/x is in fact concave. An alternative is to rewrite the inequality
(1 − x)1−x ≥ e−1 (1 − x) as (1 − x)1/x−1 ≥ e−1 , apply a change of variables
y = 1/x to get (1 − 1/y)y−1 ≥ e−1 for y ∈ [1, ∞), and then argue that (a)
equality holds in the limit as y goes to infinity, and (b) the left-hand-side is
CHAPTER 7. HASHING 145

a nonincreasing function, since


d   d
ln (1 − 1/y)y−1 = [(y − 1) (ln(y − 1) − ln y)]
dy dy
1 1
 
= ln(1 − 1/y) + (y − 1) −
y−1 y
= ln(1 − 1/y) + 1 − (1 − 1/y)
= ln(1 − 1/y) + 1/y
≤ −1/y + 1/y
= 0.

We now return to (7.8.1). We’d really like the second term to be small
enough that we can just write nρ as n1/(1+) . (Note that even though it looks
negative, it isn’t, because ln(1 − r/d) is negative.) So we pull a rabbit out of
a hat by assuming that r/d < 1/ ln n.8 This assumption can be justified by
modifying the algorithm so that d is padded out with up to d ln n unused
junk bits if necessary. Using this assumption, we get

nρ < n1/(1+) n− ln(1−1/ ln n)/(1+)


= n1/(1+) (1 − 1/ ln n)− ln n
≤ en1/(1+) .

Plugging into the formula for (r1 , r2 )-PLEB gives O(n1/(1+) log n log(1/δ))
hash function evaluations per query, each of which costs O(1) time, plus
O(n1/(1+) log(1/δ)) distance computations, which will take O(d) time each.
If we add in the cost of the binary search, we have to multiply this by
O(log log1+ R log log log1+ R), where the log-log-log comes from having to
adjust δ so that the error doesn’t accumulate too much over all O(log log R)
steps. The end result is that we can do approximate nearest-neighbor queries
in
 
O n1/(1+) log(1/δ)(log n + d) log log1+ R log log log1+ R

time. For  reasonably large, this is much better than naively testing against
all points in our database, which takes O(nd) time (although it does produce
an exact result).
8
Indyk and Motwani pull this rabbit out of a hat a few steps earlier, but it’s pretty
much the same rabbit either way.
CHAPTER 7. HASHING 146

7.8.5 Hash functions for `1 distance


Essentially the same approach works for (bounded) `1 distance, using dis-
cretization, where we replace a continuous variable over some range with
a discrete variable. Suppose we are working in [0, 1]d with the `1 metric.
Represent each coordinate xi as a sequence of d/ values xij in unary, for
j = 1 . . . d, with xij = 1 if j/d < xi . Then the Hamming distance between
the bit-vectors representing x and y is proportional to the `1 distance between
the original vectors, plus an error term that is bounded by . We can then
use the hash functions for Hamming distance to get a locality-sensitive hash
family.
A nice bit about this construction is that we don’t actually have to build
the bit-vectors; instead, we can specify a coordinate xi and a threshold c
and get the same effect by recording whether xi > c or not.
Note that this does increase the cost slightly: we are converting d-
dimensional vectors into (d/)-long bit vectors, so the log(n+d) term becomes
log(n + d/). When n is small, this effectively multiples the cost of a query by
an extra log(1/). More significant is that we have to cut  in half to obtain
the same error bounds, because we now pay  error for the data structure
itself and an additional  error for the discretization. So our revised cost for
the `1 case is
 
O n1/(1+/2) log(1/δ)(log n + d/) log log1+/2 R log log log1+/2 R .
Chapter 8

Dimension reduction

In this chapter, we will discuss how randomization can be used to reduce


the dimension of a set of points in a way that approximately preserves the
distance between points. The main tool for doing this is a family of closely-
related results that collectively are known as the Johnson-Lindenstrauss
lemma.
We can’t really do full justice to the Johnson-Lindenstrauss lemma and
its applications here, although we will try to hit the high points. There is an
excellent survey by Freksen [Fre21] if you’d like to learn more.

8.1 The Johnson-Lindenstrauss lemma


The Johnson-Lindenstrauss lemma [JL84] says that it is possible to
project a set of n vectors in a space of arbitrarily high dimension onto an
O(log n)-dimensional subspace, such that the distances between the vectors
are approximately preserved. There are several versions of the lemma,
depending on how the projection is done, but in each case we want to
show that for sufficiently large k as a function of n there is some k × d
matrix A such that for any two d-dimensional vectors u and v in our set,
(1 − )ku − vk2 ≤ kAu − Avk2 ≤ (1 + )ku − vk2 . Typical choices for A are:

• A projection matrix for a uniform random k-dimensional subspace.


(Original Johnson-Lindenstrauss paper [JL84], also used in the Dasgupta-
Gupta [DG03] proof described below in §8.1.2.)

• A matrix whose elements are i.i.d. normal random variables. (Indyk-


Motwani [IM98].)

147
CHAPTER 8. DIMENSION REDUCTION 148

• A matrix whose elements are i.i.d. variables with a particular distribu-


tion on {−1, 0, +1}. (Achlioptas [Ach03].)

• A matrix whose elements are ±1 coin-flips [Ach03]. This is perhaps


the easiest to keep track of, but the constants in the error bounds are
a bit worse than the {−1, 0, +1} version.

In each case, we multiply the matrix by an appropriate fixed scale factor


c so that E kcAuk2 = kuk2 . The resulting linear projection is known as a


Johnson-Lindenstrauss transformation (JLT for short).

8.1.1 Reduction to single-vector case


Most proofs of the theorem reduce to the case of a single vector, by finding
a value of k such that Pr (1 − )kuk2 ≤ kAuk2 ≤ (1 + )kuk2 ≥ 1 − 2/n2 ,


and then using the union bound to show that this same property holds for
all n2 vectors u − v with nonzero probability. This shows the existence of a


good matrix, and we can generate matrices and test them until we find one
that actually works.

8.1.2 A relatively simple proof of the lemma


This proof is due to Dasgupta and Gupta [DG03], and is somewhat simpler
than many proofs of the theorem. The basic idea is that if A projects onto
a uniform random k-dimensional subspace, then its effect on the length of
an arbitrary nonzero vector u is the same as its effect on the length of a
unit vector u/kuk, which is the same as the effect of some specific fixed
k × d matrix B applied to a random unit vector Y drawn uniformly from
the surface of the d-dimensional sphere S d . An easy choice for B is just the
matrix that extracts the first k coordinates from Y .
So how do we get a random unit vector Y ? The usual trick is to use the
fact that the multivariate normal distribution is radially symmetric: if we
generate d independent normally-distributed random variables X1 , . . . , Xd ,
then the vector Y = X/kXk is uniformly distributed over S d .1
This gives Z = BY = (X1 , . . . , Xk )/kXk, and kZk2 = (X12 + X22 + · · · +
Xk )/(X12 + · · · + Xd2 ). If each value Xi2 were exactly equal to its expectation
2

2
1
Radial symmetry is immediate from the density √12π e−x /2 of the univariate normal
distribution. If we consider a vector hX1 , . . . , Xd i of independent N (0, 1) random
P variables,
2
1 −x2
= (2π)−d/2 e− xi /2 . But
Qd
then the joint density is given by the product i=1 2π
e i /2
2 2
P
xi = r where r is the distance from the origin, meaning this distribution has the same
density at all points at the same distance.
CHAPTER 8. DIMENSION REDUCTION 149

1 (the variance of a standard N (0, 1) normal random variable), then kZk2


would be exactly equal to k/d, and we could adjust
p the length of Z to equal
1 by multiplying Z by a correction factor of d/k. But we generally won’t
have each Xi2 equal to 1; instead, we will use a Chernoff bound argument
to show that kZk2 lies between (1 − )(k/d) and (1 + )(k/d), for suitable 
and k, with high probability.
Following the approach in [DG03], we’ll write β for (1 − ) or (1 + )
as appropriate. Starting with the lower bound, we want to show that for
β = 1 − , it is unlikely that

kZk2 ≤ β(k/d)

which expands to
Pk 2
i=1 Xi
Pd 2
≤ β(k/d).
i=1 Xi

Having a ratio between two sums is a nuisance, but we can multiply out
the denominators to turn it into something we can apply a Chernoff-style
argument to.

k d
"P # " #
k 2
i=1 Xi
X X
Pr Pd 2
≤ β(k/d) = Pr d Xi2 ≤ βk Xi2
i=1 Xi i=1 i=1
h i
= Pr βk(X12 + · · · + Xd2 ) − d(X12 + . . . Xk2 ) ≥ 0
h   i
= Pr exp t βk(X12 + · · · + Xd2 ) − d(X12 + . . . Xk2 ) ≥1
h i
≤ E exp(t(βk(X12 + · · · + Xd2 ) − d(X12 + . . . Xk2 )))
h id−k h ik
= E exp(tβkX 2 ) E exp(t(βk − d)X 2 ) ,

where t can be any value greater than 0, the shift from probability to
expectation uses Markov’s inequality, and in the last step we replace each
independent occurrence of Xi with a standard normal random variable X.
h Now2
i we just need to be able to compute the moment generating function
E e sX for X 2 . The quick way to do this is to notice that X 2 has a
chi-squared distribution with one degree of freedom (since the chi-squared
distribution with k degrees of freedom is just the distribution of the sum
of squares of k independent normal random variables), and look up its
m.g.f. (1 − 2s)−1/2 (for s < 1/2).
CHAPTER 8. DIMENSION REDUCTION 150

We can substitute this m.g.f. in the formula above to get


(d−k)/2  k/2
1 1
h i 
Pr (X12 + · · · + Xk2 )/(X12 + · · · + Xd2 ) ≤ β(k/d) ≤ .
1 − 2βkt 2t(βk − d)
The rest is just the usual trick of finding the minimum value of this
expression over all values t (that don’t produce negative denominators in the
m.g.f.!) by differentiating and setting to 0. This turns out to be easiest if we
maximize 1/g(t)2 where g(t) is the above expression; see the Dasgupta-Gupta
1−β
paper for details. For β = 1 −  < 1, they show that the optimal t is 2β(d−kβ) ,
which gives, after a few intermediate steps,
h i
Pr kZk2 ≤ β(k/d) ≤ e(k/2)(1−β+ln β) .

The same argument applied to g(−t) gives essentially the same bound
for β = 1 +  > 1:
h i
Pr kZk2 ≥ β(k/d) ≤ e(k/2)(1−β+ln β) .

The 1 − β + ln β factors are very close to 0. For β = 1 − , we have


1 − β + ln β =  + ln(1 − ) = Θ(−2 ), and for β = 1 + , we similarly
have 1 − β + ln β = − + ln(1 + ) = Θ(−2 ). So in either case we need
k = Θ(−2 ln n) to make the probability bound polynomially small in n.
A more precise calculation (see [DG03]) includes the next term in the
Taylor series expansion of 1 ±  to get an exact inequality. This gives a good
map at k ≥ 4(2 /2 − 3 /3)−1 ln n. Note that both our crummy asymptotic
bound and this more precise bound depend on n but not d.
Constructing a projection matrix for a random k-dimensional space is
mildly painful. The easiest way to do it may be to generate a random
k × d matrix of independent N (0, 1) variables and then apply Gram-Schmidt
orthogonalization, which takes O(k 2 d) time. A faster approach that gives
similar results is to use a matrix of independent random variables that are ±1
with probability 1/6 each and 0 the rest of the time. This doesn’t produce
exactly the same distribution on projections, but it can be shown (with much
more effort) to still work pretty well [Ach03].
If we leave out enough details, we can summarize all of these results as a
single lemma:
Lemma 8.1.1 ([JL84]). For every set of n points X in Rd , and any  with
0 <  < 1, there is a linear projection f : Rd → Rk with k = O(−2 log n)
such that for any u and v in X,
(1 − )ku − vk2 ≤ kf (u) − f (v)k ≤ (1 + )ku − vk.
CHAPTER 8. DIMENSION REDUCTION 151

8.1.3 Distributional version


It’s worth nothing that the argument for Lemma 8.1.1 doesn’t use any
property of X except its size. This means that the same argument works
(with nonzero probability) without needing to know X. This gives the
“distributional” version of the lemma, which says

Lemma 8.1.2 ([JL84]). For every d, 0 <  < 1, and 0 < δ < 1, there exists
a distribution over linear functions f : Rd → Rk with k = O(−2 log(1/δ))
such that for every x ∈ Rd ,
h i
Pr (1 − )kxk2 ≤ kf (x)k2 ≤ (1 + )kxk2 ≥ 1 − δ.

This can be handy for applications where we don’t know the vectors we
will be working with in advance.

8.2 Applications
The intuition is that the Johnson-Lindenstrauss lemma lets us reduce the
dimension of some problem involving distances between n points, where we
are willing to tolerate a small constant relative error, from some arbitrary d
to O(log n) (or O(log(1/δ)) if we just care about the error probability per
pair of points). So applications tend to fall into one of two categories:

1. We want to run some algorithm on a set of points in a high-dimensional


space, but the cost of the algorithm is high (perhaps exponential!) as
a function of the dimension.

2. We want to run some algorithm on a set of points in a high-dimensional


space, but we don’t want to pay the linear-in-the-dimension space costs.

If we are lucky, we’ll get both payoffs, winning on both time and space.
For example, suppose we have a set of n points x1 , x2 , . . . , xn representing
the centers of various clusters in a d-dimensional space, and we want to
rapidly classify incoming points y into one of these clusters by finding which
xi y is closest to. If we do this naively, this is a Θ(nd) operation, since it takes
Θ(d) time to compute each distance between y and some xi . If instead we
are willing to accept the inaccuracy associated with Johnson-Lindenstrauss,
we can fix a matrix A in advance, replace each xi with Axi , and find the
xi that is (approximately) closest to y using O(d log n) time to reduce y to
Ay and O(n log n) time to compute the distance between Ay and each Axi .
In this case we are reducing both the time complexity of our classification
CHAPTER 8. DIMENSION REDUCTION 152

algorithm (at least if we don’t count the pre-processing time to generate the
Axi ) and the amount of data we need to store.
An example of space savings is the use of the Johnson-Lindenstrauss
transform in streaming algorithms (see §7.7). Freksen [Fre21] gives a simple
example of estimating kxk2 where x is a vector of counts of items from a set
of size n presented one at a time. If we don’t charge for the space to store
the JLT function f , we can simply add the i-th column of the matrix to our
running total whenever we see item i, and we need only store O −2 log(1/δ)
distinct numerical values of an appropriate precision to estimate kxk2 to
within  relative error with probability at least 1 − δ. The problem is that
in reality we do need to represent f somehow, and even for a ±1 matrix
this will take Θ(n−2 log(1/δ)) space. Fortunately it can be shown that
generating f using a 4-independent hash function reduces the space for f to
O(log n), giving the Tug-of-War sketch of Alon et al. [AMS96], one of the
first compact streaming data structures. Though this is a nice application of
the JLT, it’s worth mentioning Cormode and Muthukrishnan [CM05] observe
that this is still significantly more costly for most queries than their own
count-min sketch.
Chapter 9

Martingales and stopping


times

In §5.3.2, we used martingales to show that the outcome of some process


was tightly concentrated. Here we will show how martingales interact with
stopping times, which are random variables that control when we stop
carrying out some task. This will require a few new definitions.

9.1 Definitions
The general form of a martingale {Xt , Ft } consists of:

• A sequence of random variables X0 , X1 , X2 , . . . ; and

• A filtration F0 ⊆ F1 ⊆ F2 . . . , where each σ-algebra Ft represents


our knowledge at time t;

subject to the requirements that:

1. The sequence of random variables is adapted to the filtration, which


just means that each Xt is Ft -measurable or equivalently that Ft (and
thus all subsequent Ft0 for t0 ≥ t) includes all knowledge of Xt ; and

2. The martingale property

E [Xt+1 | Ft ] = Xt (9.1.1)

holds for all t.

153
CHAPTER 9. MARTINGALES AND STOPPING TIMES 154

We will also also need the following definition of a stopping time.


Given a filtration {Ft }, a random variable τ is a stopping time for {Ft } if
τ ∈ N ∪ {∞} and the event [τ ≤ t] is Ft -measurable for all t ∈ N.1 In simple
terms, τ is a stopping time if you know at time t whether to stop there or
not.
What we like about martingales is that iterating the martingale property
shows that E [Xt ] = E [X0 ] for all fixed t. We will show that, under reasonable
conditions, the same holds for Xτ when τ is a stopping time. (The random
variable Xτ is defined in the obvious way, as a random variable that takes
on the value of Xt when τ = t.)

9.2 Submartingales and supermartingales


In some cases we have a process where instead of getting equality in (9.1.1),
we get an inequality instead. A submartingale replaces (9.1.1) with

Xt ≤ E [Xt+1 | Ft ] (9.2.1)

while a supermartingale satisfies

Xt ≥ E [Xt+1 | Ft ] . (9.2.2)

In each case, what is “sub” or “super” is the value at the current time
compared to the expected value at the next time. Intuitively, a submartingale
corresponds to a process where you win on average, while a supermartingale
is a process where you lose on average. Casino games (in profitable casinos)
are submartingales for the house and supermartingales for the player.
Sub- and supermartingales can be reduced to martingales by subtracting
off the expected change at each step. For example, if {Xt } is a submartingale
with respect to {Ft }, then the process {Yt } defined recursively by

Y0 = X0
Yt+1 = Yt + Xt+1 − E [Xt+1 | Ft ]
1
Different authors impose different conditions on the range of τ ; for example, Mitzen-
macher and Upfal [MU17] exclude the case τ = ∞. We allow τ = ∞ to represent the
outcome where we never stop. This can be handy for modeling processes where this
outcome is possible, although in practice we will typically insist that it occurs only with
probability zero.
CHAPTER 9. MARTINGALES AND STOPPING TIMES 155

is a martingale, since

E [Yt+1 | Ft ] = E [Yt + Xt+1 − E [Xt+1 | Ft ] | Ft ]


= Yt + E [Xt+1 | Ft ] − E [Xt+1 | Ft ]
= Yt .

One way to think of this is that Yt = Xt + ∆t , where ∆t is a predictable,


non-decreasing drift process that starts at 0. For supermartingales, the
same result holds, but now ∆t is non-increasing. This ability to decompose
an adapted stochastic process into the sum of a martingale and a predictable
drift process is known as the Doob decomposition theorem.

9.3 The optional stopping theorem


If (Xt , Ft ) is a martingale, then applying induction to the martingale property
shows that E [Xt ] = E [X0 ] for any fixed time t. The optional stopping
theorem shows that this also happens for Xτ when τ is a stopping time,
under various choices of additional conditions:

Theorem 9.3.1. Let (Xt , Ft ) be a martingale and τ a stopping time for


{Ft }. Then E [Xτ ] = E[X0 ] if at least one of the following conditions holds:

1. Bounded time. There is a fixed n such that τ ≤ n always.

2. Finite time and bounded range. Pr [τ < ∞] = 1, and there is a


fixed M such that for all t ≤ τ , |Xt | ≤ M .

3. Finite expected time and bounded increments. E [τ ] < ∞, and


there is a fixed c such that |Xt+1 − Xt | ≤ c for all t < τ .

4. General case. All three of the following conditions hold:

(a) Pr [τ < ∞] = 1,
(b) E [|Xτ |] < ∞, and
h i
(c) limt→∞ E Xt · 1[τ >t] = 0.

It would be nice if we could show E [Xτ ] = E[X0 ] without the side


conditions, but in general this isn’t true. For example, the double-after-losing
martingale strategy in the St. Petersburg paradox (see §3.4.1.1) eventually
yields +1 with probability 1, so if τ is the time we stop playing, we have
Pr [τ < ∞] = 1, E [|Xτ |] < ∞, but E [Xτ ] = 1 6= E [X0 ] = 0. To make this
CHAPTER 9. MARTINGALES AND STOPPING TIMES 156

happen, we have to violate all of bounded time (τ is not bounded), bounded


range (|Xt | roughly doubles every step until we stop), bounded increments
(|Xt+1 − Xt | doubles every step as well), and at least one of the three
conditions of the general case (the last one: limt→∞ E [Xt ] · 1[τ >t] = −1 6= 0).
To prove Theorem 9.3.1, we’ll use a truncation argument. The intuition
is that for any fixed n, we can truncate Xτ to Xmin(τ,n) and show that
h i
E Xmin(τ,n) = E [X0 ]. This will immediately give the bounded time case.
h i
For the other cases, the argument is that limn→∞ E Xmin(τ,n) converges to
h i
E [Xτ ] provided the missing part E Xτ − Xmin(τ,n) converges to zero. How
we do this depends on which assumptions we are making.
We’ll start by formalizing the core truncation argument:

Lemma 9.3.2. Let (Xht , Ft ) be aimartingale and τ a stopping time for {Ft }.
Then for any n ∈ N, E Xmin(τ,n) = E[X0 ].
Pt
Proof. Define Yt = X0 + i=1 (Xt − Xt−1 )1[τ >t−1]
h
. Then (Yt , Ft ) is a martin-
i
gale, because we can calculate E [Yt+1 | Ft ] = E Yt + (Xt+1 − Xt )1[τ >t] Ft =
Yt + 1[τ >t] · E [Xt+1 − Xt | Ft ] = Yt ; effectively, we are treating 1[τ ≤t−1] as a
sequence of bets, and we know that h adjusting i our bets doesn’t change the
martingale property. But then E Xmin(τ,n) = E [Yn ] = E [Y0 ] = E [X0 ].

As claimed, this gives us the bounded-timeh varianti for free. If τ ≤ n


always, then Xτ = Xmin(τ,n) , and E [Xτ ] = E Xmin(τ,n) = E [X0 ].
For each of the unbounded-time variants, we will apply some version of
the following strategy:
h i
1. Observe that since E Xmin(τ,n) = E [X0 ] is a constant for any fixed n,
h i
limn→∞ E Xmin(τ,n) converges to E [X0 ].
h i
2. Argue using whatever assumptions we are making that limn→∞ E Xmin(τ,n)
also converges to E [Xτ ].

3. Conclude that E [X0 ] = E [Xτ ], since they are both limits of the same
sequence.

For the middle step, start with

Xτ = Xmin(τ,n) + 1[τ >n] (Xτ − Xn ).


CHAPTER 9. MARTINGALES AND STOPPING TIMES 157

This holds because either τ ≤ n, and we just get Xτ , or τ > n, and we get
Xn + (Xτ − Xn ) = Xτ .
Taking the expectation of both sides gives
h i h i
E [Xτ ] = E Xmin(τ,n) + E 1[τ >n] (Xτ − Xn )
h i
= E [X0 ] + E 1[τ >n] (Xτ − Xn ) .

So if we can show that the right-hand term goes to zero in the limit, we are
done. h i
For the bounded-range case, we have |Xτ − Xn | ≤ 2M , so E 1[τ >n] (Xτ − Xn ) ≤
2M ·Pr [τ > n]. Since in this case we assume Pr [τ < ∞] = 1, limn→∞ Pr [τ > n] =
0, and the theorem holds.
For bounded increments, we have
 
h i X
E (Xτ − Xn )1[τ >n] = E (Xt+1 − Xt )1[τ >t] 
t≥n
 
X
≤ E |(Xt+1 − Xt )| · 1[τ >t] 
t≥n
 
X
≤ E c · 1[τ >t] 
t≥n
 
X
≤ cE 1[τ >t]  .
t≥n

But E [τ ] = ∞
P
t=0 1[τ >t] . Under the assumption that this sequence converges,
its tail goes to zero, and again the theorem holds.
For the general case, we can expand
h i h i h i
E [Xτ ] = E Xmin(τ,n) + E 1[τ >n] Xτ − E 1[τ >n] Xn

which implies
h i h i h i
lim E [Xτ ] = lim E Xmin(τ,n) + lim E 1[τ >n] Xτ − lim E 1[τ >n] Xn ,
n→∞ n→∞ n→∞ n→∞

assuming all these limits exist and are finite. We’ve already established that
the first limit is E [X0 ], which is exactly what we want. So we just need to
show that the other two limits both converge tohzero. Forithe last limit, we
just use condition (4c), which gives limn→∞ E 1[τ >n] Xn = 0; no further
CHAPTER 9. MARTINGALES AND STOPPING TIMES 158

argument is needed. But we still need to show that the middle limit also
vanishes. h i P h i
Here we use condition (4b). Observe that E 1[τ >n] Xτ = ∞
t=n+1 E 1 [τ =t] Xt .
h i
Compare this with E [Xτ ] = ∞
P
t=0 E 1[τ =t] Xt ; this is an absolutely conver-
gent series (this is why we need condition (4b)), so in the limit the sum of
the terms for i = 0 . . . n converges to E [Xτ ]. But this means that the sum
of the remaining terms for i = n + 1 . . . ∞ converges to zero. So the middle
term goes to zero as n goes to infinity. This completes the proof.

9.4 Applications
Here we give some example of the Optional Stopping Theorem in action. In
each case, the trick is to find an appropriate martingale and stopping time,
and let the theorem do all the work.

9.4.1 Random walks


Let Xt be an unbiased ±1 random walk that starts at 0, adds ±1 to its
current position with equal probability at each step, and stops if it reaches
−a or +b.2 We’d like to calculate the probability of reaching +b before −a.
Let τ be the time at which the process stops.
We can easily show that Pr [τ < ∞] = 1 and E [τ ] < ∞ by observing that
from any state of the random walk, there is a probability of at least 2−(a+b)
that it stops within a + b steps (by flipping heads a + b times in a row),
so that if we consider a sequence of intervals of length a + b, the expected
number of such intervals we can have before we stop is at most 2a+b , giving
E [τ ] ≤ (a + b)2a+b (we can do better than this).
We also have bounded increments by the definition of the process
(bounded range also works, at least up until time τ ). So E [Xτ ] = E[X0 ] = 0
and the probability p of landing on +b instead of −a must satisfy pb − (1 −
a
p)a = 0, giving p = a+b .
Now suppose we want to find E [τ ]. Let Yt = Xt2 − t. Then Yt+1 =
(Xt ± 1)2 − (t + 1) = Xt2 ± 2Xt + 1 − (t + 1) = (Xt2 − t) ± 2Xt = Yt ± 2Xt . Since
the plus and minus cases are equally likely, they cancel out in expectation
and E [Yt+1 | Ft ] = Yt : we just showed Yt is a martingale.3 We can also show
2
This is called a random walk with two absorbing barriers.
3
This construction generalizes in a nice way to arbitrary martingales. Suppose {Xt } is
a martingale with respect to {Ft }. Let ∆t = Xt − Xt−1 , and let Vt = Var [∆t | Ft−1 ] be
the conditional variance of the t-th increment (note that this is a random variable that
CHAPTER 9. MARTINGALES AND STOPPING TIMES 159

it has bounded increments (at least up until time τ ), because |Yt+1 − Yt | =


2|Xt | ≤ max(a, b).
9.3.1, E [Yτ ] = 0, which gives E [τ ] = E Xτ2 . But we
 
From Theorem
can calculate E Xτ : it is a2 Pr [Xτ = −a] + b2 Pr [Xt = b] = a2 (b/(a + b)) +
 2

b2 (a/(a + b)) = (a2 b + b2 a)/(a + b) = ab.


If we have a random walk that only stops at +b,4 then if τ is the
first time at which Xτ = b, τ is a stopping time. However, in this case
E [Xτ ] = b 6= E [X0 ] = 0. So the optional stopping theorem doesn’t apply in
this case. But we have bounded increments, so Theorem 9.3.1 would apply if
E [τ ] < ∞. It follows that the expected time until we reach b is unbounded,
either because sometimes we never reach b, or because we always reach b but
sometimes it takes a very long time. 5
Pt
may depend on previous outcomes). We can easily show that Yt = Xt2 − i=1
Vi is a
martingale. The proof is that
" t
#
X
E [Yt | Ft−1 ] = E Xt2 − Vt Ft−1
i=1
t
X
= E (Xt−1 + ∆t )2 Ft−1 −
 
Vi
i=1
t
2
X
+ 2Xt−1 ∆t + ∆2t Ft−1 −
 
= E Xt−1 Vi
i=1
t
2
X
+ 2Xt−1 E [∆t | Ft−1 ] + E ∆2t Ft−1 −
 
= Xt−1 Vi
i=1
t
2
X
= Xt−1 + 0 + Vt − Vi
i=1
t−1
2
X
= Xt−1 − Vi
i=1

= Yt−1 .
Pt  
For the ±1 random walk case, we have Vt = 1 always, giving V = t and E Xτ2 =
i=1 i
 2
E X0 + E [τ ] when τ is a stopping time satisfying the conditions of the Optional Stopping
    Pτ 
Theorem. For the general case, the same argument gives E Xτ2 = E X02 + E V
t=1 t
instead: the expected square position of Xt is incremented by the conditional variance at
each step.
4
This would be a random walk with one absorbing barrier.
5
In fact, we always reach b. An easy way to see this is to imagine a sequence of intervals
 Pi 2
of length n1 , n2 , . . . , where ni+1 = b + n
. At the end of the i-th interval, we
j=1 j
Pi √
are no lower than − n , so we only need to go up ni+1 positions to reach a by the
j=0 j
CHAPTER 9. MARTINGALES AND STOPPING TIMES 160

We can also consider a biased random walk where +1 occurs with


probability p and −1 with probability q = 1 − p. If Xt is the position of the
random walk at time t, and Ft is the associated σ-algebra, then Xt isn’t a
martingale with respect to Ft . But there are at least two ways to turn it
into one:
1. Define Yt = Xt − (p − q)t. Then

E [Yt+1 | Ft ] = E [Xt+1 − (p − q)(t + 1) | Ft ]


= p(Xt + 1)Ft + q(Xt − 1)Ft − (p − q)(t + 1)
= (p + q)Xt + (p − q) − (p − q)(t + 1)
= Xt − (p − q)t
= Yt ,

and Yt is a martingale with respect to Ft .

2. Define Zt = (q/p)Xt . Then

E [Zt+1 | Ft ] = p(q/p)Xt +1 + q(q/p)Xt −1


= (q/p)Xt (p(q/p) + q(p/q))
= (q/p)Xt (q + p)
= (q/p)Xt
= Zt .

Again we have a martingale with respect to Ft .


Now let’s see what we can do with the Optional Stopping Theorem.

1. Suppose we start with X0 = 0 and we want to know the probability pb


that we will reach +b before we reach −a.
Let τ be the first time at which Xτ ∈ {−a, b}. We can use the same
argument as in the unbiased case to show that Pr [τ < ∞] = 1 and
E [τ ] < ∞, because from any position Xt there is at least a pa+b > 0
chance that the next a + b steps will all be +1 and we will reach b.
Since we can flip this pa+b -probability coin independently every a + b
steps, eventually we reach b if we haven’t already reached −a. The
expected time is bounded by (a + b)/pa+b .
end of the (i + 1)-th interval. Since this is just one standard deviation, it occurs with
constant probability, so after a finite expected number of intervals, we will reach +a. Since
there are infinitely many intervals, we reach +a with probability 1.
CHAPTER 9. MARTINGALES AND STOPPING TIMES 161
 
We also have that 0 < Zt ≤ max (q/p)a , (q/p)b for all t ≤ τ . This
gives us bounded range, so the finite-time/bounded-range case of OST
applies.
We thus have E [Zτ ] = pb (q/p)b +(1−pb )(q/p)−a = E [Z0 ] = (q/p)0 = 1.
Solving for pb gives
1 − (q/p)−a
pb = .
(q/p)b − (q/p)−a
(Note that this only makes sense if q 6= p.)
As a test, if −a = −1 and +b = +1, then we get
1 − p/q
pb =
q/p − p/q
pq − p2
=
q−p
p(q − p)
=
q−p
= p,
which is what we would expect since we hit −a or +b on the first step.
A more interesting case is if we set p < 1/2. Then q/p > 1, and
1 − (q/p)−a
pb =
(q/p)b − (q/p)−a
1 − (q/p)−a
= (q/p)−b ·
1 − (q/p)−a−b
< (q/p)−b .

Any walk that is biased against us will be an exponentially improbable


hill to climb.
2. Now suppose we want to know E [τ ], the average time at which we
first hit a or b. We already argued E [τ ] is finite, and it’s easy to see
that {Yt } has bounded increments, so we can use the finite-expected-
time/bounded-increment case of OST to get E [Yτ ] = E [Y0 ] = 0, or
E [Xτ − (p − q)τ ] = 0. It follows that E [τ ] = E [Xτ ] /(p − q).
But we can compute E [Xτ ], since it is just pb b − (1 − pb )a. So E [τ ] =
pb b−(1−pb )a
p−q . If p > q, and a and b are both large enough to make pb
very close to 1, this will be approximately b/(p − q), the time to climb
to b using our average return of p − q per step.
CHAPTER 9. MARTINGALES AND STOPPING TIMES 162

9.4.2 Wald’s equation


Suppose we run a Las Vegas algorithm until it succeeds, and the i-th attempt
costs Xi , where all the Xi are independent, satisfy 0 ≤ Xi ≤ c for some c,
and have a common mean E [Xi ] = µ.
Let N be the number of times we run the algorithm. Since we can tell
when we are done, N is a stopping time with respect to some filtration {Fi }
6
i Xi are adapted. Suppose also that E [N ] exists. What is
toh which the
PN
E i=1 Xi ?
If N were not a stopping time, this might be a very messy problem
indeed. But when N is a stopping time, we can apply it to the martingale
Yt = ti=1 (Xi − µ). This has bounded increments (0 ≤ Xi ≤ c, so −c ≤
P

Xi − E [Xi ] ≤ c), and we’ve already said E [N ] is finite (which implies


Pr [N < ∞] = 1), so Theorem 9.3.1 applies. We thus have
0 = E [YN ]
"N #
X
=E (Xi − µ)
i=1
"N # "N #
X X
=E Xi − E µ
i=1 i=1
"N #
X
=E Xi − E [N ] µ.
i=1

Rearranging this gives Wald’s equation:


"N #
X
E Xi = E [N ] µ. (9.4.1)
i=1

This is the same formula as in §3.4.3.1, but we’ve eliminated the bound
on N and allowed for much more dependence between N and the Xi .7

9.4.3 Maximal inequalities


Suppose we have a martingale {Xi } with Xi ≥ 0 always, and we want to
bound maxi≤n Xi . We can do this using the Optional Stopping Theorem:
6
A stochastic process {Xt } is adapted to a filtration {Ft } if each Xt is Ft -measurable.
7
In fact, looking closely at the proof reveals that we don’t even need the Xi to be
independent of each other. We just need that E [Xi+1 | Fi ] = µ for all i to make (Yt , Ft ) a
martingale. But if we don’t carry any information from one iteration of our Las Vegas
algorithm to the next, we’ll get independence anyway. So the big payoff is not having to
worry about whether N has some devious dependence on the Xi .
CHAPTER 9. MARTINGALES AND STOPPING TIMES 163

Lemma 9.4.1. Let {Xi } be a martingale with Xi ≥ 0. Then for any fixed
n,
E [X0 ]
 
Pr max Xi ≥ α ≤ . (9.4.2)
i≤n α
Proof. The idea is to pick a stopping time τ such that maxi≤n Xi ≥ α if and
only if Xτ ≥ α.
Let τ be the first time such that Xτ ≥ α or τ ≥ n. Then τ is a stopping
time for {Xi }, since we can determine from X0 , . . . , Xt whether τ ≤ t or
not. We also have that τ ≤ n always, which is equivalent to τ = min(τ, n).
Finally, Xτ ≥ α means that maxi≤n Xi ≥ Xτ ≥ α, and conversely if there
is some t ≤ n with Xt = maxi≤n Xi ≥ α, then τ is the first such t, giving
Xτ ≥ α.
Lemma 9.3.2 says E [Xτ ] = E [X0 ]. So Markov’s inequality gives Pr [maxi≤n Xi ≥ α] =
Pr [Xτ ≥ α] ≤ E[Xα
τ]
= E[X 0]
α , as claimed.

Lemma 9.4.1 is a special case of Doob’s martingale inequality, which


says that for a non-negative submartingale {Xi },

E [Xn ]
 
Pr max Xi ≥ α ≤ . (9.4.3)
i≤n α
The proof is similar, but requires showing first that E [Xτ ] ≤ E [Xn ] when
τ ≤ n is a stopping time and {Xi } is a submartingale.
Doob’s martingale inequality is what you get if you generalize Markov’s
inequality to martingales. The analogous generalization of Chebyshev’s
inequality is Kolmogorov’s inequality, which says:
Pi
Lemma 9.4.2. For sums Si = j=1 Xj of independent random variables
X1 , X2 , . . . , Xn with E [Xi ] = 0,

Var [S]
 
Pr max|Si | ≥ α ≤ . (9.4.4)
i≤n α2

Proof. Let Yi = Si2 − Var [Si ]. Then {Yi } is a martingale. This implies that
2
  2
E [Yn ] = E Sn − Var [S] = Y0 = 0 and thus that E Sn = Var [S]. It’s easy
to see that Si2 is a submartingale since partial sums can only increase over
time. Now apply (9.4.3).

In general, because we can always stop updating a martingale once we hit


a particular threshold, other martingale concentration bounds like Azuma-
Hoeffding will also apply to the maximum or minimum of a martingale over
a given interval.
CHAPTER 9. MARTINGALES AND STOPPING TIMES 164

9.4.4 Waiting times for patterns


Let’s suppose we flip coins until we see some pattern appear: for example,
we might flip coins until we see HTHH. What is the expected number of
coin-flips until this happens?
A very clever trick due to Li [Li80] solves this problem exactly using the
Optional Stopping Theorem. Suppose our pattern is x1 x2 . . . xk . We imagine
an army of gamblers, one of which shows up before each coin-flip. Each
gambler starts by betting $1 that next coin-flip will be x1 . If they win, they
bets $2 that the next coin-flip will be x2 , continuing to play double-or-nothing
until either they lose (and is down $1) or wins her last bet on xk (and is up
2k − 1). Because each gambler’s winnings form a martingale, so does their
sum, and so the expected total return of all gamblers up to the stopping
time τ at which our pattern first occurs is 0.
We can now use this fact to compute E [τ ]. When we stop at time τ , we
have one gambler who has won 2k − 1. We may also have other gamblers
who are still in play. For each i with x1 . . . xi = xk−i+1 . . . xk , there will be a
gambler with net winnings ij=1 2j−1 = 2i − 1. The remaining gamblers will
P

all be at −1.
Let χi = 1 if x1 . . . xi = xk−i+1 . . . xk , and 0 otherwise. Then the number
of losers is given by τ − ki=1 χi and the total expected payoff is
P

k k
" #
X X  
E [Xτ ] = E −(τ − χi ) + χi 2i − 1
i=1 i=1
k
" #
X  
i
= E −τ + χi 2
i=1
= 0.

It follows that E [τ ] = ki=1 χi 2i .


P

As a quick test, the pattern H has E [τ ] = 21 = 2. This is consistent with


what we know about geometric distributions.
For a longer example, the pattern HTHH only overlaps with its prefix H,
so in this case we have E [τ ] = χi 2i = 16 + 2 = 18. But HHHH overlaps
P

with all of its prefixes, giving E [τ ] = 16 + 8 + 4 + 2 = 30. At the other


extreme, THHH has no overlap at all and gives E [τ ] = 16.
In general, for a pattern of length k, we expect a waiting time somewhere
between 2k and 2k+1 − 2—almost a factor of 2 difference depending on how
much overlap we get.
This analysis generalizes in the obvious way to biased coins and larger
alphabets. See the paper [Li80] for details.
Chapter 10

Markov chains

A (discrete time) Markov chain is a sequence of random variables X0 , X1 , X2 , . . . ,


which we think of as the position of some particle at increasing times in N,
where the distribution of Xt+1 depends only on the value of Xt . A typical
example of a Markov chain is a random walk on a graph: each Xt is a node
in the graph, and a step moves to one of the neighbors of the current node
chosen at random, each with equal probability.
Markov chains come up in randomized algorithms both because the
execution of any randomized algorithm is, in effect, a Markov chain (the
random variables are the states of the algorithm); and because we can
often sample from distributions that are difficult to sample from directly
by designing a Markov chain that converges to the distribution we want.
Algorithms that use this latter technique are known as Markov chain
Monte Carlo algorithms, and rely on the fundamental fact that a Markov
chain that satisfies a few straightforward conditions will always converge in
the limit to a stationary distribution, no matter what state it starts in.
An example of this technique that predates randomized algorithms is
card shuffling: each permutation of the deck is a state, and the shuffling
operation sends the deck to a new state each time it is applied. Assuming
the shuffling operation is not too deterministic, it is possible to show that
enough shuffling will eventually produce a state that is close to being a
uniform random permutation. The big algorithmic question for this and
similar Markov chains is how quickly this happens: what is the mixing time
of the Markov chain, a measure of how long we have to run it to get close to
its limit distribution (this notion is defined formally in §10.2.3). Many of
the techniques in this chapter will be aimed at finding bounds on the mixing
time for particular Markov processes.

165
CHAPTER 10. MARKOV CHAINS 166

If you want to learn more about Markov chains than presented here, they
are usually covered in general probability textbooks (for example, in [Fel68]
or [GS01]), mentioned in many linear algebra textbooks [Str03], covered
in some detail in stochastic processes textbooks [KT75], and covered in
exquisite detail in many books dedicated specifically to the subject [KS76,
KSK76]. Good sources for mixing times for Markov chains are the textbook
of Levin, Peres, and Wilmer [LPW09] and the survey paper by Montenegro
and Tetali [MT05]. An early reference on the mixing times for random
walks on graphs that helped inspire much subsequent work is the Aldous-
Fill manuscript [AF01], which can be found on-line at http://www.stat.
berkeley.edu/~aldous/RWG/book.html.

10.1 Basic definitions and properties


A Markov chain or Markov process is a stochastic process1 where the
distribution of Xt+1 depends only on the value of Xt and not any previous
history. Formally, this means that

Pr [Xt+1 = j | Xt = it , Xt−1 = it−1 , . . . , X0 = i0 ] = Pr [Xt+1 = j | Xt = it ] .


(10.1.1)
A stochastic process with this property is called memoryless: at any time,
you know where you are, and you can figure out where you are going, but
you don’t know where you were before.
The state space of the chain is just the set of all values that each Xt can
have. A Markov chain is finite or countable if it has a finite or countable
state space, respectively. We’ll mostly be interested in finite Markov chains
(since we have to be able to fit them inside our computer), but countable
Markov chains will come up in some contexts.2
We’ll also assume that our Markov chains are homogeneous, which
means that Pr [Xt+1 = j | Xt = i] doesn’t depend on t.
1
A stochastic process is just a sequence of random variables {St }, where we usually
think of t as representing time and the sequence as representing the evolution of some
system over time. Here we are considering discrete-time processes, where t will typically
be a non-negative integer.
2
If the state space is not countable, we run into the same measure-theoretic issues as
with continuous random variables, and have to replace (10.1.1) with the more general
condition that    
E 1[Xt ∈A] Xt , Xt−1 , . . . , X0 = E 1[Xt ∈A] Xt ,
provided A is measurable with respect to some appropriate σ-algebra. We don’t really
want to deal with this, and for the most part we don’t have to, so we won’t.
CHAPTER 10. MARKOV CHAINS 167

p12 p23 p34 


p11 p12 0 0

p 0 p23 0 
p11 p44  21
1 2 3 4

 0 p32 0 p34 
 
p21 p32 p43
0 0 p43 p44

Figure 10.1: Drawing a Markov chain as a directed graph. Nodes represent


states. Edges, labeled with probabilities, represent possible transitions. Zero-
probability transitions are usually omitted. The corresponding transition
matrix is shown on the right.

For a homogeneous countable Markov chain, we can describe its behavior


completely by giving the state space and the one-step transition proba-
bilities pij = Pr [Xt+1 = j | Xt = i]. Given pij , we can calculate two-step
transition probabilities
(2)
pij = Pr [Xt+2 = j | Xt = i]
X
= Pr [Xt+2 = j | Xt+1 = k] Pr [Xt+1 = k | Xt = i]
k
X
= pik pkj .
k

This is identical to the formula for matrix multiplication. For a Markov


chain with n states, we can specify the transition probabilities pij using an
n × n transition matrix P with Pij = pij , and the two-step transition
probabilities are given by p(2) )ij = Pij2 . More generally, the t-step transition
(t)
probabilities are given by pij = (P t )ij .
Conversely, given any matrix with non-negative entries where the rows
P
sum to 1 ( j Pij = 1, or P 1 = 1, where 1 in the second equation stands for
the all-ones vector), there is a corresponding Markov chain given by pij = Pij .
Such a matrix is called a stochastic matrix; and for every stochastic matrix
there is a corresponding finite Markov chain and vice versa.
(s+t)
The general formula for (s+t)-step transition probabilities is that pij =
P (s) (t)
k pik pkj . This is known as the Chapman-Kolmogorov equation and
is equivalent to the matrix identity P s+t = P s P t . It is also similar to
the formula for counting paths in a directed graph, and we can use the
correspondence between matrices and (labeled) directed graphs to give visual
depictions of particular Markov chains, as shown in Figure 10.1.
A distribution over states of a finite Markov chain at some time t can
be given by a row vector x, where xi = P r[Xt = i]. To compute the
CHAPTER 10. MARKOV CHAINS 168

distribution at time t + 1, we use the law of total probability: Pr [Xt+1 = j] =


i Pr [Xt = i] Pr [Xt+1 = j | Xt = i] =
P P
i xi pij . Again we have the formula
for matrix multiplication (where we treat x as a 1 × n matrix); so the
distribution vector at time t + 1 is just xP , and at time t + n is xP n .
We like Markov chains for two reasons:

1. They describe what happens in a randomized algorithm; the state


space is just the set of all states of the algorithm, and the Markov
property holds because the algorithm can’t remember anything that
isn’t part of its state. So if we want to analyze randomized algorithms,
we will need to get good at analyzing Markov chains.

2. They can be used to do sampling over interesting distributions. Under


appropriate conditions (see below), the state of a Markov chain con-
verges to a stationary distribution. If we build the right Markov
chain, we can control what this stationary distribution looks like, run
the chain for a while, and get a sample close to the stationary distribu-
tion.

In both cases we want to have a bound on how long it takes the Markov
chain to converge, either because it tells us when our algorithm terminates,
or because it tells us how long to mix it up before looking at the current
state.

10.1.1 Examples
• A fair ±1 random walk. The state space is Z, the transition probabilities
are pij = 1/2 if |i − j| = 1, 0 otherwise. This is an example of a Markov
chain that is also a martingale.

• A fair ±1 random walk on a cycle. As above, but now the state space
is Z/m, the integers mod m. This is a finite Markov chain. It is also in
some sense a martingale, although we usually don’t define martingales
over finite groups.

• Random walks with absorbing and/or reflecting barriers.

• Random walk on a graph G = (V, E). The state space is V , the


transition probabilities are puv = 1/d(u) if uv ∈ E.
One can also have more general transition probabilities, where the
probability of traversing a particular edge is a property of the edge and
not the degree of its source. In principle we can represent any Markov
CHAPTER 10. MARKOV CHAINS 169

chain as a random walk on graph in this way: the states become vertices,
and the transitions become edges, each labeled with its transition
probability. It’s conventional in this representation to exclude edges
with probability 0 and include self-loops for any transitions i → i.
If the resulting graph is small enough or has a nice structure, this can
be a convenient way to draw a Markov chain.

• The Markov chain given by Xt+1 = Xt + 1 with probability 1/2, and 0


with probability 1/2. The state space is N.

• A finite-state machine running on a random input. The sequence


of states acts as a Markov chain, assuming each input symbol is
independent of the rest.

• A classic randomized algorithm for 2-SAT, due to Papadimitriou [Pap91].


Each state is a truth-assignment. The transitional probabilities are
messy but arise from the following process: pick an unsatisfied clause,
pick one of its two variables uniformly at random, and invert it. Then
there is an absorbing state at any satisfying assignment. With a bit of
work, it can be shown that the Hamming distance between the current
assignment and some satisfying assignment follows a random walk that
is either unbiased or biased toward 0, giving a satisfying assignment
after O(n2 ) steps on average.3 This algorithm is not necessarily all
that good, because there is a clever deterministic algorithm that solves
2-SAT in time linear in the size of the formula [APT79], but it has the
nice property of not being so clever.

• A similar process works for 2-colorability, 3-SAT, 3-colorability, etc.,


although for NP-hard problems, it may take a while to reach an
absorbing state. The constructive Lovász Local Lemma proof from
§13.3.5 also follows this pattern.
3
The proof of this is not too hard: given an unsatisfying assignment x and a satisfying
assignment y, any clause that is not satisfied by x includes at least one variable that is
satisfied by y. If we get lucky and flip this variable, we reduce the distance by one. If not,
we increase it by at most 1 (depending on whether y satisfies only one or both variables).
So we get a process bounded by an unbiased random walk with a reflecting barrier at n
and an absorbing barrier at 0, assuming we don’t hit some other satisfying assignment y 0
first.
CHAPTER 10. MARKOV CHAINS 170

10.2 Convergence of Markov chains


We want to use Markov chains for sampling. Typically this means that we
have some subset S of the state space, and we want to know what proportion
of the states are in S. If we can’t sample states from from the state space a
priori, we may be able to get a good approximation by running the Markov
chain for a while and hoping that it converges to something predictable.
To show this, we will proceed though several steps:

1. We will define a class of distributions, known as stationary distribu-


tions, that we hope to converge to (§10.2.1

2. We will define a distance between distributions, the total variation


distance (§10.2.2).

3. We will define the mixing time of a Markov chain as the minimum


time for the distribution of the position of a particle to get within a
given total variation distance of the stationary distribution (§10.2.3).

4. We will describe a technique called coupling that can be used to


bound total variation distance (§10.2.4), in terms of the probability
that two dependent variables X and Y with the distributions we are
looking at are or are not equal to each other.

5. We will define reducible and periodic Markov chains, which have


structural properties that prevent convergence to a unique stationary
distribution (§10.2.5).

6. We will use a coupling between two copies of a Markov chain to


show that any Markov chain that does not have these properties does
converge in total variation distance to a unique stationary distribution
(§10.2.6).

7. Finally, we will show that if we can construct a coupling between two


copies of a particular chain that causes both copies to reach the same
state quickly on average, then we can use this expected coupling time
to bound the mixing time of the chain that we defined previously. This
will give us a practical tool for showing that many Markov chains not
only converge eventually but converge at a predictable rate, so we can
tell when it is safe to stop running the chain and take a sample (§10.4).

Much of this section follows the approach of Chapter 4 of Levin et


al. [LPW09].
CHAPTER 10. MARKOV CHAINS 171

10.2.1 Stationary distributions


For a finite Markov chain, there is a transition matrix P in which each row
sums to 1. We can write this fact compactly as P 1 = 1, where 1 is the
all-ones column vector. This means that 1 is a right eigenvector of P with
eigenvalue 1.4 Because the left eigenvalues of a matrix are equal to the right
eigenvalues, this means that there will be at least one left eigenvector π such
that πP = π, and in fact it is possible to show that there is at least one
such π that represents a probability distribution in that each πi ≥ 0 and
πi = 1. Such a distribution is called a stationary distribution5 of the
P

Markov chain, and if π is unique, the probability πi is called the stationary


probability for i.
Every finite Markov chain has at least one stationary distribution, but
it may not be unique. For example, if the transition matrix is the identity
matrix (meaning that the particle never moves), then all distributions are
stationary.
If a Markov chain does have a unique stationary distribution, we can
calculate it from the transition matrix, by observing that the equations
πP = P
and
π1 = 1
together give n + 1 equations in n unknowns, which we can solve for π. (We
need an extra equation because the stochastic property of P means that it
has rank at most n − 1.)
Often this will be impractical, especially if our state space is large enough
or messy enough that we can’t write down the entire matrix. In these cases
we may be able to take advantage of a special property of some Markov
chains, called reversibility. We’ll discuss this in §10.3. For the moment
we will content ourselves with showing that we do in fact converge to some
unique stationary distribution if our Markov chain has the right properties.

10.2.2 Total variation distance


Definition 10.2.1. Let X and Y be random variables defined on the same
probability space. Then the total variation distance between X and Y ,
4
Given a square matrix A, a vector x is a right eigenvector of A with eigenvalue λ is
Ax = λx. Similarly, a vector y is a left eigenvalue of A with eigenvalue λ if yA = λA.
5
Spelling is important here. A stationery distribution would involve handing out office
supplies.
CHAPTER 10. MARKOV CHAINS 172

written dT V (X, Y ) is given by


dT V (X, Y ) = sup (Pr [X ∈ A] − Pr [Y ∈ A]) , (10.2.1)
A

where the supremum is taken over all sets A for which Pr [X ∈ A] and
Pr [Y ∈ A] are both defined.6
An equivalent definition is
dT V (X, Y ) = sup|Pr [X ∈ A] − Pr [Y ∈ A]|.
A

The reason this is equivalent is that if Pr [X ∈ A] − Pr [Y ∈ A] is negative,


we can replace A by its complement.
Less formally, given any test set A, X and Y show up in A with proba-
bilities that differ by at most dT V (X, Y ). This is usually what we want for
sampling, since this says that if we are testing some property (represented by
A) of the states we are sampling, the answer (Pr [X ∈ A]) that we get for how
likely this property is to occur is close to the correct answer (Pr [Y ∈ A]).
Total variation distance is a property of distributions, and is not affected
by dependence between X and Y . For finite Markov chains, we can define
the total variation distance between two distributions x and y as
X X
dT V (x, y) = max (xi − yi ) = max (xi − yi ) .
A A
i∈A i∈A

A useful fact is that dT V (x, y) is directly connected to the `1 distance between


x and y. If we let B = {i | xi ≥ yi }, then
X
dT V (x, y) = max (xi − yi )
A
i∈A
X
≤ (xi − yi ),
i∈B

because if A leaves out an element of B, or includes and element of B, this


can only reduce the sum. But if we consider B instead, we get
X
dT V (y, x) = max (yi − xi )
A
i∈A
X
≤ (yi − xi ).
i∈B
6
P For discrete random variables, this just means all A, since we can write Pr [X ∈ A] as
x∈A
Pr [X = x]; we can also replace sup with max for this case. For continuous random
variables, we want that X −1 (A) and Y −1 (A) are both measurable. If our X and Y range
over the states of a countable Markov chain, we will be working with discrete random
variables, so we can just consider all A.
CHAPTER 10. MARKOV CHAINS 173

Now observe that


X
kx − yk1 = |xi − yi |
i
X X
= (xi − yi ) + (yi − xi )
i∈B i∈B
= dT V (x, y) + dT V (y, x)
= 2dT V (x, y).

So dT V (x, y) = 21 kx − yk1 .

10.2.2.1 Total variation distance and expectation


Sometimes it’s useful to translate a bound on total variation to a bound on
the error when getting the expectation of a random variable. The following
lemma may be handy:

Lemma 10.2.2. Let x and y be two distributions of some discrete random


variable Z. Let Ex (Z) and Ey (Z) be the expectations of Z with respect to
each of these distributions. Suppose that |Z| ≤ M always. Then

|Ex (Z) − Ey (Z)| ≤ 2M · dT V (x, y). (10.2.2)

Proof. Compute

X  
|Ex (Z) − Ey (Z)| = z Pr(Z = z) − Pr(Z = z)
x y
z
X
≤ |z| · Pr(Z = z) − Pr(Z = z)
x y
z
X
≤M Pr(Z = z) − Pr(Z = z)
x y
z
≤ M kx − yk1
= 2M · dT V (x, y).

10.2.3 Mixing time


We are going to show that well-behaved finite Markov chains eventually
converge to some stationary distribution π in total variation distance. This
CHAPTER 10. MARKOV CHAINS 174

means that for any  > 0, there is a mixing time tmix () such that for any
initial distribution x and any t ≥ tmix (),

dT V (xP t , π) ≤ .

It is common to standardize  as 1/4: if we write just tmix , this means


tmix (1/4). The choice of 1/4 is somewhat arbitrary, but has some nice
technical properties that we will see below.

10.2.4 Coupling of Markov chains


A coupling of two random variables X and Y is a joint distribution on
hX, Y i that gives the correct marginal distribution for each of X and Y
while creating a dependence between them with some desirable property (for
example, minimizing total variation distance or maximize Pr [X = Y ]).
We will use couplings between Markov chains to prove convergence.
Here we take two copies of the chain, once of which starts in an arbitrary
distribution, and one of which starts in the stationary distribution, and show
that we can force them to converge to each other by carefully correlating their
transitions. Since the second chain is always in the stationary distribution,
this will show that the first chain converges to the stationary distribution as
well.
The tool that makes this work is the Coupling Lemma:7

Lemma 10.2.3. For any discrete random variables X and Y ,

dT V (X, Y ) ≤ Pr [X 6= Y ] .
7
It turns out that the bound in the Coupling Lemma is tight in the following sense:
for any given distributions on X and Y , there exists a joint distribution giving these
distributions such that dT V (X, Y ) is exactly equal to Pr [X 6= Y ] when X and Y are
sampled from the joint distribution. For discrete distributions, the easiest way to con-
struct the joint distribution is first to let to let Y = X = i for each i with prob-
ability min(Pr [X = i] , Pr [Y = i]), and then distribute the remaining probability for
X over all the cases where Pr [X = i] > Pr [Y = i] and similarly for Y over all the
cases where Pr P[Y = i] > Pr [X = i]. Looking at the unmatched values for X gives
Pr [X 6= Y ] ≤ {x | Pr X=i>Pr Y =i} (Pr [X = i] − Pr [Y = i]) ≤ dT V (X, Y ). So in this case
Pr [X 6= Y ] = dT V (X, Y ).
Unfortunately, the fact that there always exists a perfect coupling in this sense does not
mean that we can express it in any convenient way, or that even if we could, it would arise
from the kind of causal, step-by-step construction that we will use for couplings between
Markov processes.
CHAPTER 10. MARKOV CHAINS 175

Proof. Let A be any set for which Pr [X ∈ A] and Pr [Y ∈ A] are defined.


Then

Pr [X ∈ A] = Pr [X ∈ A ∧ Y ∈ A] + Pr [X ∈ A ∧ Y 6∈ A] ,
Pr [Y ∈ A] = Pr [X ∈ A ∧ Y ∈ A] + Pr [X 6∈ A ∧ Y ∈ A] ,

and thus

Pr [X ∈ A] − Pr [Y ∈ A] = Pr [X ∈ A ∧ Y 6∈ A] − Pr [X 6∈ A ∧ Y ∈ A]
≤ Pr [X ∈ A ∧ Y 6∈ A]
≤ Pr [X 6= Y ] .

Since this holds for any particular set A, it also holds when we take the
maximum over all A to get dT V (X, Y ).

For Markov chains, our goal will be to find a useful coupling between a se-
quence of random variables X0 , X1 , X2 , . . . corresponding to the Markov chain
starting in an arbitrary distribution with a second sequence Y0 , Y1 , Y2 , . . .
corresponding to the same chain starting in a stationary distribution. What
will make a coupling useful is if Pr [Xt 6= Yt ] is small for reasonably large t:
since Yt has the stationary distribution, this will show that dT V (xP t , π) is
also small.
Our first use of this technique will be to show, using a rather generic
coupling, that Markov chains with certain nice properties converge to their
stationary distribution in the limit. Later we will construct specialized
couplings for particular Markov chains to show that they converge quickly.
But first we will consider what properties a Markov chain must have to
converge at all.

10.2.5 Irreducible and aperiodic chains


Not all chains are guaranteed to converge to their stationary distribution.
If some states are not reachable from other states, it may be that starting
in one part of the chain will keep us from ever reaching another part of the
chain. Even if the chain is not disconnected in this way, we might still not
converge if the distribution oscillates back and forth due to some periodicity
in the structure of the chain. But if these conditions do not occur, then we
will be able to show convergence.
Let ptij be Pr [Xt = j | X0 = i].
A Markov chain is irreducible if, for all states i and j, there exists some
t such that ptij 6= 0. This says that we can reach any state from any other
CHAPTER 10. MARKOV CHAINS 176

state if we wait long enough. If we think of a directed graph of the Markov


chain where the states are vertices and each edge represents a transition that
occurs with nonzero probability, the Markov chain is irreducible if its graph
is strongly connected.
The period of a state i of a Markov chain is gcd t > 0 ptii 6= 0 . If
 

the period of i is m, then starting from i we can only return to i at times that
are multiples of m. If m = 1, state i is said to be aperiodic. A Markov chain
as a whole is aperiodic if all of its states are aperiodic. In graph-theoretic
terms, this means that the graph of the chain is not k-partite for any k > 1.
Reversible chains are also an interesting special case: if a chain is reversible,
it can’t have a period greater than 2, since we can always step off a node
and step back.
If our Markov chain is not aperiodic, we can make it aperiodic by flipping
a coin at each step to decide whether to move or not. This gives a lazy
Markov chain whose transition probabilities are given by 12 pij when i 6= j
and 12 + 12 pij when i = j. This doesn’t affect the stationary distribution: if
we replace our transition
 
matrix P with a new transition matrix P +I 2 , and
πP = π, then π P +I2 = 21 πP + 12 πI = 12 π + 12 π = π.
Unfortunately there is no quick fix for reducible Markov chains. But
since we will often be designing the Markov chains we will be working with,
we can just take care to make sure they are not reducible.
We will later need the following lemma about aperiodic Markov chains,
which is related to the Frobenius problem of finding the minimum value
that cannot be constructed using coins of given denominations:

Lemma 10.2.4. Let i be an aperiodic state of some Markov chain. Then


there is a time t0 such that ptii 6= 0 for all t ≥ t0 .

Proof. Let S = {t | pii (t) 6= 0}. Since gcd(S) = 1, there is a finite subset S 0
of S such that gcd S 0 = 1. Write the elements of S 0 as m1 , m2 , . . . , mk and let
M = kj=1 mj . From the extended Euclidean algorithm, there exist integer
Q

coefficients aj with |aj | ≤ M/mj such that kj=1 aj mj = 1. We would like


P

to use each aj as the number of times to go around the length mj loop from
i to i. Unfortunately many of these aj will be negative.
To solve this problem, we replace aj with bj = aj + M/mj . This makes all
the coefficients non-negative, and gives kj=1 bj mj = kM + 1. This implies
P

that there is a sequence of loops that gets us from i back to i in kM + 1


steps, or in other words that pkM ii
+1
6= 0. By repeating this sequence ` times,
`kM +`
we can similarly show that pii 6= 0 for any `.
We can also pad any of these sequences out by as many copies of M as
CHAPTER 10. MARKOV CHAINS 177

we like. In particular, given `kM + `, where ` ∈ {0, . . . , M − 1}, we can add


(kM − `)M to it to get (kM )2 + `. This means that we can express any
t ∈ (kM )2 , . . . , (kM )2 + M − 1 as a sum of elements of S, or equivalently
that ptii =
6 0 for any such t. But for larger t, we can just add in more copies
of M . So in fact ptii 6= 0 for all t ≥ t0 = (kM )2 .

10.2.6 Convergence of finite irreducible aperiodic Markov


chains
We can now show:

Theorem 10.2.5. Any finite irreducible aperiodic Markov chain converges


to a unique stationary distribution in the limit.

Proof. Consider two copies of the chain {Xt } and {Yt }, where X0 starts in
some arbitrary distribution x and Y0 starts in a stationary distribution π.
Define a coupling between {Xt } and {Yt } by the rule: (a) if Xt = 6 Yt , then
Pr [Xt+1 = j ∧ Yt+1 = j 0 | Xt = i ∧ Yt = i0 ] = pij pi0 j 0 ; and (b) if Xt = Yt ,
then Pr [Xt+1 = Yt+1 = j | Xt = Yt = i] = pij . Intuitively, we let both chains
run independently until they collide, after which we run them together. Since
each chain individually moves from state i to state j with probability pij
in either case, we have that Xt evolves normally and Yt remains in the
stationary distribution.
Now let us show that dT V (xP t , π) ≤ Pr [Xt 6= Yt ] goes to zero in the
limit. Pick some state i. Let r be the maximum over all states j of the first
passage time fji where fji is the minimum time t such that ptji 6= 0. Let s
be a time such that ptii 6= 0 for all t ≥ s (the existence of such an s is given
by Lemma 10.2.4).
Suppose that at time `(r + s), where ` ∈ N, X`(r+s) = j = 6 j 0 = Y`(r+s) .
0 0
Then there are times `(r + s) + u and `(r + s) + u , where u, u ≤ r, such that
X reaches i at time `(r + s) + u and Y reaches i at time `(r + s) + u0 with
nonzero probability. Since (r + s − u) ≤ s, then having reached i at these
times, X and Y both return to i at time `(r + s) + (r + s) = (` + 1)(r + s) with
nonzero hprobability. Let  > 0 be ithe product of hthese nonzero probabilities;
i
then Pr X(`+1)(r+s) 6= Y(`+1)(r+s) ≤ (1 − ) Pr X`(r+s) = Y`(r+s) , and in
general we have Pr [Xt 6= Yt ] ≤ (1 − )bt/(r+s)c , which goes to zero in the
limit. This implies that dT V (xP t , π) also goes to zero in the limit (using the
Coupling Lemma), and since any initial distribution (including a stationary
distribution) converges to π, π is the unique stationary distribution as
claimed.
CHAPTER 10. MARKOV CHAINS 178

This argument requires that the chain be finite, because otherwise we


cannot take the maximum over all the first passage times. For infinite Markov
chains, it is not always enough to be irreducible and aperiodic to converge
to a stationary distribution (or even to have a stationary distribution at all).
However, with some additional conditions a similar result can be shown: see
for example [GS01, §6.4].

10.3 Reversible chains


A Markov chain with transition probabilities pij is reversible if there is a
distribution π such that, for all i and j,

πi pij = πj pji . (10.3.1)

These are called the detailed balance equations—they say that in the
stationary distribution, the probability of seeing a transition from i to j is
equal to the probability of seeing a transition from j to i). If this is the case,
P P
then i πi pij = i πj pji = πj , which means that π is stationary.
It’s worth noting that this works for countable chains even if they are
not finite, because the sums always converge since each term is non-negative
P P
and i πi pij is dominated by i πi = 1. However, it may not be the case
for any particular p that there exists a corresponding stationary distribution
π. If this happens, the chain is not reversible.

10.3.1 Stationary distributions


The detailed balance equations often give a very quick way to compute the
stationary distribution, since if we know πi , and pij = 6 0, then πj = πi pij /pji .
If the transition probabilities are reasonably well-behaved (for example, if
pij = pij for all i, j), we may even be able to characterize the stationary
distribution up to a constant multiple even if we have no way to efficiently
enumerate all the states of the process.
What (10.3.1) says is that if we start in the stationary distribution and ob-
serve either a forward transition hX0 , X1 i or a backward transition hX1 , X0 i,
we can’t tell which is which; Pr [X0 = i ∧ X1 = j] = Pr [X1 = i ∧ X0 = j].
This extends to longer sequences. The probability that X0 = i, X1 = j, and
X2 = k is given by πi pij pjk = pji πj pjk = pji pkj πk , which is the probability
that X0 = k, X1 = j, and X0 = i. (A similar argument works for finite
sequences of any length.) So a reversible Markov chain is one with no arrow
of time in the stationary distribution.
CHAPTER 10. MARKOV CHAINS 179

A typical reversible Markov chain is a random walk on a graph, where


a step starting from a vertex u goes to one of its neighbors v, which each
1
neighbor chosen with probability d(u) . This has a stationary distribution

d(u)
πu = P
u d(u)
d(u)
= ,
2|E|
which satisfies
d(u 1
πu puv = ·
2|E| d(u)
d(v 1
= ·
2|E| d(v)
= πv pvu .
If we don’t know π in advance, we can often guess it by observing that
πi pij = πj pji
implies
pij
πj = πi , (10.3.2)
pji
provided pij 6= 0. This gives us the ability to calculate πk starting from any
initial state i as long as there is some chain of transitions i = i0 → i1 → i2 →
. . . i` = k where each step im → im+1 has pim ,im+1 6= 0. For a random walk
on a graph, this implies that π is unique as long as the graph is connected.
This of course only works for