0% found this document useful (0 votes)
36 views495 pages

Lecture Notes

Uploaded by

bellaadoraa
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
36 views495 pages

Lecture Notes

Uploaded by

bellaadoraa
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Lecture Notes on Statistics and Information Theory

John Duchi

December 6, 2023
Contents

1 Introduction and setting 8


1.1 Information theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.2 Moving to statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.3 A remark about measure theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.4 Outline and chapter discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

2 An information theory review 12


2.1 Basics of Information Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.1.1 Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.1.2 Chain rules and related properties . . . . . . . . . . . . . . . . . . . . . . . . 17
2.1.3 Data processing inequalities: . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.2 General divergence measures and definitions . . . . . . . . . . . . . . . . . . . . . . . 20
2.2.1 Partitions, algebras, and quantizers . . . . . . . . . . . . . . . . . . . . . . . . 20
2.2.2 KL-divergence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.2.3 f -divergences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.2.4 Inequalities and relationships between divergences . . . . . . . . . . . . . . . 25
2.2.5 Convexity and data processing for divergence measures . . . . . . . . . . . . 29
2.3 First steps into optimal procedures: testing inequalities . . . . . . . . . . . . . . . . 30
2.3.1 Le Cam’s inequality and binary hypothesis testing . . . . . . . . . . . . . . . 30
2.3.2 Fano’s inequality and multiple hypothesis testing . . . . . . . . . . . . . . . . 31
2.4 A first operational result: entropy and source coding . . . . . . . . . . . . . . . . . . 33
2.4.1 The source coding problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
2.4.2 The Kraft-McMillan inequalities . . . . . . . . . . . . . . . . . . . . . . . . . 35
2.4.3 Entropy rates and longer codes . . . . . . . . . . . . . . . . . . . . . . . . . . 37
2.5 Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
2.6 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

3 Exponential families and statistical modeling 45


3.1 Exponential family models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
3.2 Why exponential families? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
3.2.1 Fitting an exponential family model . . . . . . . . . . . . . . . . . . . . . . . 50
3.3 Divergence measures and information for exponential families . . . . . . . . . . . . . 51
3.4 Generalized linear models and regression . . . . . . . . . . . . . . . . . . . . . . . . . 52
3.4.1 Fitting a generalized linear model from a sample . . . . . . . . . . . . . . . . 55
3.5 Lower bounds on testing a parameter’s value . . . . . . . . . . . . . . . . . . . . . . 56
3.6 Deferred proofs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

1
Lexture Notes on Statistics and Information Theory John Duchi

3.6.1 Proof of Proposition 3.2.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58


3.7 Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
3.8 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

I Concentration, information, stability, and generalization 61

4 Concentration Inequalities 62
4.1 Basic tail inequalities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
4.1.1 Sub-Gaussian random variables . . . . . . . . . . . . . . . . . . . . . . . . . . 64
4.1.2 Sub-exponential random variables . . . . . . . . . . . . . . . . . . . . . . . . 68
4.1.3 Orlicz norms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
4.1.4 First applications of concentration: random projections . . . . . . . . . . . . 73
4.1.5 A second application of concentration: codebook generation . . . . . . . . . . 75
4.2 Martingale methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
4.2.1 Sub-Gaussian martingales and Azuma-Hoeffding inequalities . . . . . . . . . 78
4.2.2 Examples and bounded differences . . . . . . . . . . . . . . . . . . . . . . . . 79
4.3 Uniformity and metric entropy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
4.3.1 Symmetrization and uniform laws . . . . . . . . . . . . . . . . . . . . . . . . 82
4.3.2 Metric entropy, coverings, and packings . . . . . . . . . . . . . . . . . . . . . 86
4.4 Generalization bounds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
4.4.1 Finite and countable classes of functions . . . . . . . . . . . . . . . . . . . . . 91
4.4.2 Large classes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
4.4.3 Structural risk minimization and adaptivity . . . . . . . . . . . . . . . . . . . 95
4.5 Technical proofs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
4.5.1 Proof of Theorem 4.1.11 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
4.5.2 Proof of Theorem 4.1.15 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
4.5.3 Proof of Theorem 4.3.6 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
4.6 Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
4.7 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99

5 Generalization and stability 104


5.1 The variational representation of Kullback-Leibler divergence . . . . . . . . . . . . . 105
5.2 PAC-Bayes bounds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
5.2.1 Relative bounds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
5.2.2 A large-margin guarantee . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
5.2.3 A mutual information bound . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
5.3 Interactive data analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
5.3.1 The interactive setting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
5.3.2 Second moment errors and mutual information . . . . . . . . . . . . . . . . . 116
5.3.3 Limiting interaction in interactive analyses . . . . . . . . . . . . . . . . . . . 117
5.3.4 Error bounds for a simple noise addition scheme . . . . . . . . . . . . . . . . 122
5.4 Bibliography and further reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
5.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124

2
Lexture Notes on Statistics and Information Theory John Duchi

6 Advanced techniques in concentration inequalities 128


6.1 Entropy and concentration inequalities . . . . . . . . . . . . . . . . . . . . . . . . . . 128
6.1.1 The Herbst argument . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129
6.1.2 Tensorizing the entropy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130
6.1.3 Concentration of convex functions . . . . . . . . . . . . . . . . . . . . . . . . 134

7 Privacy and disclosure limitation 138


7.1 Disclosure limitation, privacy, and definitions . . . . . . . . . . . . . . . . . . . . . . 138
7.1.1 Basic mechanisms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140
7.1.2 Resilience to side information, Bayesian perspectives, and data processing . . 144
7.2 Weakenings of differential privacy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146
7.2.1 Basic mechanisms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147
7.2.2 Connections between privacy measures . . . . . . . . . . . . . . . . . . . . . . 149
7.2.3 Side information protections under weakened notions of privacy . . . . . . . . 152
7.3 Composition and privacy based on divergence . . . . . . . . . . . . . . . . . . . . . . 155
7.3.1 Composition of Rényi-private channels . . . . . . . . . . . . . . . . . . . . . . 155
7.3.2 Privacy games and composition . . . . . . . . . . . . . . . . . . . . . . . . . . 156
7.4 Additional mechanisms and privacy-preserving algorithms . . . . . . . . . . . . . . . 158
7.4.1 The exponential mechanism . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158
7.4.2 Local sensitivities and the inverse sensitivity mechanism . . . . . . . . . . . . 161
7.5 Deferred proofs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166
7.5.1 Proof of Lemma 7.2.10 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166
7.6 Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 169
7.7 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 169

II Fundamental limits and optimality 176

8 Minimax lower bounds: the Le Cam, Fano, and Assouad methods 178
8.1 Basic framework and minimax risk . . . . . . . . . . . . . . . . . . . . . . . . . . . . 178
8.2 Preliminaries on methods for lower bounds . . . . . . . . . . . . . . . . . . . . . . . 180
8.2.1 From estimation to testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181
8.2.2 Inequalities between divergences and product distributions . . . . . . . . . . 182
8.2.3 Metric entropy and packing numbers . . . . . . . . . . . . . . . . . . . . . . . 184
8.3 Le Cam’s method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 185
8.4 Fano’s method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 187
8.4.1 The classical (local) Fano method . . . . . . . . . . . . . . . . . . . . . . . . 187
8.4.2 A distance-based Fano method . . . . . . . . . . . . . . . . . . . . . . . . . . 192
8.5 Assouad’s method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 195
8.5.1 Well-separated problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 195
8.5.2 From estimation to multiple binary tests . . . . . . . . . . . . . . . . . . . . . 195
8.5.3 Example applications of Assouad’s method . . . . . . . . . . . . . . . . . . . 197
8.6 Nonparametric regression: minimax upper and lower bounds . . . . . . . . . . . . . 199
8.6.1 Kernel estimates of the function . . . . . . . . . . . . . . . . . . . . . . . . . 199
8.6.2 Minimax lower bounds on estimation with Assouad’s method . . . . . . . . . 203
8.7 Global Fano Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 206
8.7.1 A mutual information bound based on metric entropy . . . . . . . . . . . . . 206

3
Lexture Notes on Statistics and Information Theory John Duchi

8.7.2 Minimax bounds using global packings . . . . . . . . . . . . . . . . . . . . . . 208


8.7.3 Example: non-parametric regression . . . . . . . . . . . . . . . . . . . . . . . 208
8.8 Deferred proofs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 210
8.8.1 Proof of Proposition 8.4.6 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 210
8.8.2 Proof of Corollary 8.4.7 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 210
8.8.3 Proof of Lemma 8.5.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 211
8.9 Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 211
8.10 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 211

9 Constrained risk inequalities 220


9.1 Strong data processing inequalities . . . . . . . . . . . . . . . . . . . . . . . . . . . . 220
9.2 Local privacy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 223
9.3 Communication complexity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 227
9.3.1 Classical communication complexity problems . . . . . . . . . . . . . . . . . . 227
9.3.2 Deterministic communication: lower bounds and structure . . . . . . . . . . . 230
9.3.3 Randomization, information complexity, and direct sums . . . . . . . . . . . 232
9.3.4 The structure of randomized communication and communication complexity
of primitives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 236
9.4 Communication complexity in estimation . . . . . . . . . . . . . . . . . . . . . . . . 239
9.4.1 Direct sum communication bounds . . . . . . . . . . . . . . . . . . . . . . . . 240
9.4.2 Communication data processing . . . . . . . . . . . . . . . . . . . . . . . . . 241
9.4.3 Applications: communication and privacy lower bounds . . . . . . . . . . . . 243
9.5 Proof of Theorem 9.4.4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 247
9.5.1 Proof of Lemma 9.5.3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 251
9.6 Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 252
9.7 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 252

10 Testing and functional estimation 257


10.1 Le Cam’s convex hull method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 257
10.1.1 The χ2 -mixture bound . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 259
10.1.2 Estimating errors and the norm of a Gaussian vector . . . . . . . . . . . . . . 261
10.2 Minimax hypothesis testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 263
10.2.1 Detecting a difference in populations . . . . . . . . . . . . . . . . . . . . . . . 264
10.2.2 Signal detection and testing a Gaussian mean . . . . . . . . . . . . . . . . . . 265
10.2.3 Goodness of fit and two-sample tests for multinomials . . . . . . . . . . . . . 267
10.3 Geometrizing rates of convergence . . . . . . . . . . . . . . . . . . . . . . . . . . . . 271
10.4 Best possible lower bounds and super-efficiency . . . . . . . . . . . . . . . . . . . . . 271
10.5 Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 271
10.6 A useful divergence calculation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 272
10.7 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 274

III Entropy, predictions, divergences, and information 277

11 Predictions, loss functions, and entropies 278


11.1 Proper losses, scoring rules, and generalized entropies . . . . . . . . . . . . . . . . . 279
11.1.1 A convexity primer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 280

4
Lexture Notes on Statistics and Information Theory John Duchi

11.1.2 From a proper loss to an entropy . . . . . . . . . . . . . . . . . . . . . . . . . 282


11.1.3 The information in an experiment . . . . . . . . . . . . . . . . . . . . . . . . 283
11.2 Characterizing proper losses and Bregman divergences . . . . . . . . . . . . . . . . . 285
11.2.1 Characterizing proper losses for Y taking finitely many vales . . . . . . . . . 285
11.2.2 General proper losses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 288
11.2.3 Proper losses and vector-valued Y . . . . . . . . . . . . . . . . . . . . . . . . 291
11.3 From entropies to convex losses, arbitrary predictions, and link functions . . . . . . . 294
11.3.1 Convex conjugate linkages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 294
11.3.2 Convex conjugate linkages with affine constraints . . . . . . . . . . . . . . . . 298
11.4 Exponential families, maximum entropy, and log loss . . . . . . . . . . . . . . . . . . 301
11.4.1 Maximizing entropy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 303
11.4.2 I-projections and maximum likelihood . . . . . . . . . . . . . . . . . . . . . . 307
11.5 Technical and deferred proofs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 308
11.5.1 Finalizing the proof of Theorem 11.2.14 . . . . . . . . . . . . . . . . . . . . . 308
11.5.2 Proof of Proposition 11.4.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 309
11.5.3 Proof of Proposition 11.4.3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 310
11.6 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 311

12 Calibration and Proper Losses 315


12.1 Proper losses and calibration error . . . . . . . . . . . . . . . . . . . . . . . . . . . . 316
12.2 Measuring calibration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 319
12.2.1 The impossibility of measuring calibration . . . . . . . . . . . . . . . . . . . . 319
12.2.2 Alternative calibration measures . . . . . . . . . . . . . . . . . . . . . . . . . 322
12.3 Auditing and improving calibration at the population level . . . . . . . . . . . . . . 325
12.3.1 The post-processing gap and calibration audits for squared error . . . . . . . 325
12.3.2 Calibration audits for losses based on conjugate linkages . . . . . . . . . . . . 327
12.3.3 A population-level algorithm for calibration . . . . . . . . . . . . . . . . . . . 329
12.4 Calibeating: improving squared error by calibration . . . . . . . . . . . . . . . . . . 330
12.4.1 Proof of Theorem 12.4.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 333
12.5 Continuous and equivalent calibration measures . . . . . . . . . . . . . . . . . . . . . 336
12.5.1 Calibration measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 337
12.5.2 Equivalent calibration measures . . . . . . . . . . . . . . . . . . . . . . . . . . 339
12.6 Deferred technical proofs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 345
12.6.1 Proof of Lemma 12.2.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 345
12.6.2 Proof of Proposition 12.5.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 346
12.6.3 Proof of Lemma 12.5.4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 347
12.6.4 Proof of Theorem 12.5.6 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 348
12.7 Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 350
12.8 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 351

13 Surrogate Risk Consistency: the Classification Case 352


13.1 General results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 358
13.2 Proofs of convex analytic results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 358
13.2.1 Proof of Lemma 13.0.4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 358
13.2.2 Proof of Lemma 13.0.4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 359
13.2.3 Proof of Lemma 13.0.6 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 359
13.3 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 359

5
Lexture Notes on Statistics and Information Theory John Duchi

14 Divergences, classification, and risk 362


14.1 Generalized entropies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 368
14.2 From entropy to losses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 368
14.2.1 Classification case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 368
14.2.2 Structured prediction case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 368
14.3 Predictions, calibration, and scoring rules . . . . . . . . . . . . . . . . . . . . . . . . 369
14.4 Surrogate risk consistency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 369
14.4.1 Uniformly convex case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 369
14.4.2 Structured prediction (discrete) case . . . . . . . . . . . . . . . . . . . . . . . 369
14.4.3 Proof of Theorem 14.4.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 370
14.5 Loss equivalence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 371
14.6 Proof of Theorem 14.5.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 373
14.6.1 Proof of Lemma 14.6.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 375
14.6.2 Proof of Lemma 14.6.4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 376
14.7 Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 376
14.8 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 376

15 Fisher Information 378


15.1 Fisher information: definitions and examples . . . . . . . . . . . . . . . . . . . . . . 378
15.2 Estimation and Fisher information: elementary considerations . . . . . . . . . . . . . 380
15.3 Connections between Fisher information and divergence measures . . . . . . . . . . . 381

IV Online game playing and compression 384

16 Universal prediction and coding 385


16.1 Basics of minimax game playing with log loss . . . . . . . . . . . . . . . . . . . . . . 385
16.2 Universal and sequential prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 387
16.3 Minimax strategies for regret . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 389
16.4 Mixture (Bayesian) strategies and redundancy . . . . . . . . . . . . . . . . . . . . . . 391
16.4.1 Bayesian redundancy and objective, reference, and Jeffreys priors . . . . . . . 394
16.4.2 Redundancy capacity duality . . . . . . . . . . . . . . . . . . . . . . . . . . . 396
16.5 Asymptotic normality and Theorem 16.4.1 . . . . . . . . . . . . . . . . . . . . . . . . 396
16.5.1 Heuristic justification of asymptotic normality . . . . . . . . . . . . . . . . . 397
16.5.2 Heuristic calculations of posterior distributions and redundancy . . . . . . . . 397
16.6 Proof of Theorem 16.4.5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 398
16.7 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 400

17 Universal prediction with other losses 403


17.1 Redudancy and expected regret . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 403
17.1.1 Universal prediction via the log loss . . . . . . . . . . . . . . . . . . . . . . . 404
17.1.2 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 406
17.2 Individual sequence prediction and regret . . . . . . . . . . . . . . . . . . . . . . . . 408

6
Lexture Notes on Statistics and Information Theory John Duchi

18 Online convex optimization 413


18.1 The problem of online convex optimization . . . . . . . . . . . . . . . . . . . . . . . 413
18.2 Online gradient and non-Euclidean gradient (mirror) descent . . . . . . . . . . . . . 415
18.2.1 Proof of Theorem 18.2.5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 419
18.3 Online to batch conversions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 421
18.4 More refined convergence guarantees . . . . . . . . . . . . . . . . . . . . . . . . . . . 421
18.4.1 Proof of Proposition 18.4.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 422

19 Exploration, exploitation, and bandit problems 424


19.1 Confidence-based algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 425
19.2 Bayesian approaches to bandits . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 429
19.2.1 Posterior (Thompson) sampling . . . . . . . . . . . . . . . . . . . . . . . . . . 430
19.2.2 An information-theoretic analysis . . . . . . . . . . . . . . . . . . . . . . . . . 433
19.2.3 Information and exploration . . . . . . . . . . . . . . . . . . . . . . . . . . . . 433
19.3 Online gradient descent approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . . 433
19.4 Further notes and references . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 435
19.5 Technical proofs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 435
19.5.1 Proof of Claim (19.1.1) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 435

V Appendices 437

A Miscellaneous mathematical results 438


A.1 The roots of a polynomial . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 438
A.2 Measure-theoretic development of divergence measures . . . . . . . . . . . . . . . . . 438

B Convex Analysis 439


B.1 Convex sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 439
B.1.1 Operations preserving convexity . . . . . . . . . . . . . . . . . . . . . . . . . 441
B.1.2 Representation and separation of convex sets . . . . . . . . . . . . . . . . . . 443
B.2 Sublinear and support functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 446
B.3 Convex functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 449
B.3.1 Equivalent definitions of convex functions . . . . . . . . . . . . . . . . . . . . 450
B.3.2 Continuity properties of convex functions . . . . . . . . . . . . . . . . . . . . 452
B.3.3 Operations preserving convexity . . . . . . . . . . . . . . . . . . . . . . . . . 458
B.3.4 Smoothness properties, first-order developments for convex functions, and
subdifferentiability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 460
B.3.5 Calculus rules of subgradients . . . . . . . . . . . . . . . . . . . . . . . . . . . 465

C Optimality, stability, and duality 468


C.1 Optimality conditions and stability properties . . . . . . . . . . . . . . . . . . . . . . 469
C.1.1 Subgradient characterizations for optimality . . . . . . . . . . . . . . . . . . . 469
C.1.2 Stability properties of minimizers . . . . . . . . . . . . . . . . . . . . . . . . . 471
C.2 Conjugacy and duality properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 475
C.2.1 Gradient dualities and the Fenchel-Young inequality . . . . . . . . . . . . . . 476
C.2.2 Smoothness and strict convexity of conjugates . . . . . . . . . . . . . . . . . . 478
C.3 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 483

7
Chapter 1

Introduction and setting

This set of lecture notes explores some of the (many) connections relating information theory,
statistics, computation, and learning. Signal processing, machine learning, and statistics all revolve
around extracting useful information from signals and data. In signal processing and information
theory, a central question is how to best design signals—and the channels over which they are
transmitted—to maximally communicate and store information, and to allow the most effective
decoding. In machine learning and statistics, by contrast, it is often the case that there is a
fixed data distribution that nature provides, and it is the learner’s or statistician’s goal to recover
information about this (unknown) distribution.
A central aspect of information theory is the discovery of fundamental results: results that
demonstrate that certain procedures are optimal. That is, information theoretic tools allow a
characterization of the attainable results in a variety of communication and statistical settings. As
we explore in these notes in the context of statistical, inferential, and machine learning tasks, this
allows us to develop procedures whose optimality we can certify—no better procedure is possible.
Such results are useful for a myriad of reasons; we would like to avoid making bad decisions or false
inferences, we may realize a task is impossible, and we can explicitly calculate the amount of data
necessary for solving different statistical problems.

1.1 Information theory


Information theory is a broad field, but focuses on several main questions: what is information,
how much information content do various signals and data hold, and how much information can be
reliably transmitted over a channel. We will vastly oversimplify information theory into two main
questions with corresponding chains of tasks.

1. How much information does a signal contain?

2. How much information can a noisy channel reliably transmit?

In this context, we provide two main high-level examples, one for each of these tasks.

Example 1.1.1 (Source coding): The source coding, or data compression problem, is to
take information from a source, compress it, decompress it, and recover the original message.
Graphically, we have

Source → Compressor → Decompressor → Receiver

8
Lexture Notes on Statistics and Information Theory John Duchi

The question, then, is how to design a compressor (encoder) and decompressor (decoder) that
uses the fewest number of bits to describe a source (or a message) while preserving all the
information, in the sense that the receiver receives the correct message with high probability.
This fewest number of bits is then the information content of the source (signal). 3

Example 1.1.2: The channel coding, or data transmission problem, is the same as the source
coding problem of Example 1.1.1, except that between the compressor and decompressor is a
source of noise, a channel. In this case, the graphical representation is

Source → Compressor → Channel → Decompressor → Receiver

Here the question is the maximum number of bits that may be sent per each channel use in
the sense that the receiver may reconstruct the desired message with low probability of error.
Because the channel introduces noise, we require some redundancy, and information theory
studies the exact amount of redundancy and number of bits that must be sent to allow such
reconstruction. 3

1.2 Moving to statistics


Statistics and machine learning can—broadly—be studied with the same views in mind. Broadly,
statistics and machine learning can be thought of as (perhaps shoehorned into) source coding and
a channel coding problems.
In the analogy with source coding, we observe a sequence of data points X1 , . . . , Xn drawn from
some (unknown) distribution P on a space X . For example, we might be observing species that
biologists collect. Then the analogue of source coding is to construct a model (often a generative
model) that encodes the data using relatively few bits: that is,
X1 ,...,Xn Pb
Source (P ) −→ Compressor → Decompressor → Receiver.

Here, we estimate Pb—an empirical version of the distribution P that is easier to describe than
the original signal X1 , . . . , Xn , with the hope that we learn information about the generating
distribution P , or at least describe it efficiently.
In our analogy with channel coding, we make a connection with estimation and inference.
Roughly, the major problem in statistics we consider is as follows: there exists some unknown
function f on a space X that we wish to estimate, and we are able to observe a noisy version
of f (Xi ) for a series of Xi drawn from a distribution P . Recalling the graphical description of
Example 1.1.2, we now have a channel P (Y | f (X)) that gives us noisy observations of f (X) for
each Xi , but we may (generally) now longer choose the encoder/compressor. That is, we have
X1 ,...,Xn f (X1 ),...,f (Xn ) Y1 ,...,Yn
Source (P ) −→ Compressor −→ Channel P (Y | f (X)) −→ Decompressor.

The estimation—decompression—problem is to either estimate f , or, in some cases, to estimate


other aspects of the source probability distribution P . In general, in statistics, we do not have
any choice in the design of the compressor f that transforms the original signal X1 , . . . , Xn , which
makes it somewhat different from traditional ideas in information theory. In some cases that we
explore later—such as experimental design, randomized controlled trials, reinforcement learning
and bandits (and associated exploration/exploitation tradeoffs)—we are also able to influence the
compression part of the above scheme.

9
Lexture Notes on Statistics and Information Theory John Duchi

Example 1.2.1: A classical example of the statistical paradigm in this lens is the usual linear
regression problem. Here the data Xi belong to Rd , and the compression function f (x) = θ> x
for some vector θ ∈ Rd . Then the channel is often of the form

Yi = θ> Xi + εi ,
| {z } |{z}
signal noise

iid
where εi ∼ N(0, σ 2 ) are independent mean zero normal perturbations. The goal is, given a
sequence of pairs (Xi , Yi ), to recover the true θ in the linear model.
In active learning or active sensing scenarios, also known as (sequential) experimental design,
we may choose the sequence Xi so as to better explore properties of θ. Later in the course we
will investigate whether it is possible to improve estimation by these strategies. As one concrete
idea, if we allow infinite power, which in this context corresponds to letting kXi k → ∞—
choosing very “large” vectors xi —then the signal of θ> Xi should swamp any noise and make
estimation easier. 3

For the remainder of the class, we explore these ideas in substantially more detail.

1.3 A remark about measure theory


As this book focuses on a number of fundamental questions in statistics, machine learning, and
information theory,
R fully general Rstatements of the results often require measure theory. Thus,
formulae such as f (x)dP (x) or f (x)dµ(x) appear. While knowledge of measure theory is cer-
tainly useful and may help appreciate the results, it is completely inessential to developing the
intuition and, I hope, understanding the proofs and main results. Indeed, the best strategy (for
a reader unfamiliar with measure theory) is to simply replace every instance of a formula such as
dµ(x) with dx. The most frequent cases we encounter will be the following: we wish to compute
the expectation of a function f of randomR variable X following distribution P , that
R is, EP [f (X)].
Normally, we would write EP [f (X)] = f (x)dP (x), or sometimes EP [f (X)] = f (x)p(x)dµ(x),
saying that “P has density p with respect to the underlying measure µ.” Instead, one may simply
(and intuitively) assume that x really has density p over the reals, and instead of computing the
integral Z Z
EP [f (X)] = f (x)dP (x) or EP [f (X)] = f (x)p(x)dµ(x),

assume we may write Z


EP [f (X)] = f (x)p(x)dx.

Nothing will be lost.

1.4 Outline and chapter discussion


We divide the lecture notes into four distinct parts, each of course interacting with the others,
but it is possible to read each as a reasonably self-contained unit. The lecture notes begin with
a revew (Chapter 2) that introduces the basic information-theoretic quantities that we discuss:
mutual information, entropy, and divergence measures. It is required reading for all the chapters
that follow.

10
Lexture Notes on Statistics and Information Theory John Duchi

Part I of the notes covers what I term “stability” based results. At a high level, this means that
we ask what can be gained by considering situations where individual observations in a sequence
of random variables X1 , . . . , Xn have little effect on various functions of the sequence. We begin
in Chapter 4 with basic concentration inequalities, discussing how sums and related quantities can
converge quickly; while this material is essential for the remainder of the lectures, it does not depend
on particular information-theoretic techniques. We discuss some heuristic applications to problems
in statistical learning—empirical risk minimization—in this section of the notes. We provide a
treatment of more advanced ideas in Chapter 6, including some approaches to concentration via
entropy methods. We then turn in Chapter 5 carefully investigate generalization and convergence
guarantees—arguing that functions of a sample X1 , . . . , Xn are representative of the full population
P from which the sample is drawn—based on controlling different information-theoretic quantities.
In this context, we develop PAC-Bayesian bounds, and we also use the same framework to present
tools to control generalization and convergence in interactive data analyses. These types of analyses
reflect modern statistics, where one performs some type of data exploration before committing to a
fuller analysis, but which breaks classical statistical approaches, because the analysis now depends
on the sample. Finally, we provide a chapter (Chapter 7) on disclosure limitation and privacy
techniques, all of which repose on different notions of stability in distribution.
Part II studies fundamental limits, using information-theoretic techniques to derive lower bounds
on the possible rates of convergence for various estimation, learning, and other statistical problems.
Part III revisits all of our information theoretic notions from Chapter 2, but instead of sim-
ply giving definitions and a few consequences, provides operational interpretations of the different
information-theoretic quantities, such as entropy. Of course this includes Shannon’s original results
on the relationship between coding and entropy (Chapter 2.4.1), but we also provide an interpreta-
tion of entropy and information as measures of uncertainty in statistical experiments and statistical
learning, which is a perspective typically missing from information-theoretic treatments of entropy
(Chapters TBD). We also relate these ideas to game-playing and maximum likelihood estimation.
Finally, we relate generic divergence measures to questions of optimality and consistency in statisti-
cal and machine learning problems, which allows us to delineate when (at least in asymptotic senses)
it is possible to computationally efficiently learn good predictors and design good experiments.

11
Chapter 2

An information theory review

In this first introductory chapter, we discuss and review many of the basic concepts of information
theory in effort to introduce them to readers unfamiliar with the tools. Our presentation is relatively
brisk, as our main goal is to get to the meat of the chapters on applications of the inequalities and
tools we develop, but these provide the starting point for everything in the sequel. One of the
main uses of information theory is to prove what, in an information theorist’s lexicon, are known
as converse results: fundamental limits that guarantee no procedure can improve over a particular
benchmark or baseline. We will give the first of these here to preview more of what is to come,
as these fundamental limits form one of the core connections between statistics and information
theory. The tools of information theory, in addition to their mathematical elegance, also come
with strong operational interpretations: they give quite precise answers and explanations for a
variety of real engineering and statistical phenomena. We will touch on one of these here (the
connection between source coding, or lossless compression, and the Shannon entropy), and much
of the remainder of the book will explore more.

2.1 Basics of Information Theory


In this section, we review the basic definitions in information theory, including (Shannon) entropy,
KL-divergence, mutual information, and their conditional versions. Before beginning, I must make
an apology to any information theorist reading these notes: any time we use a log, it will always
be base-e. This is more convenient for our analyses, and it also (later) makes taking derivatives
much nicer.
In this first section, we will assume that all distributions are discrete; this makes the quantities
somewhat easier to manipulate and allows us to completely avoid any complicated measure-theoretic
quantities. In Section 2.2 of this note, we show how to extend the important definitions (for our
purposes)—those of KL-divergence and mutual information—to general distributions, where basic
ideas such as entropy no longer make sense. However, even in this general setting, we will see we
essentially lose no generality by assuming all variables are discrete.

2.1.1 Definitions
Here, we provide the basic definitions of entropy, information, and divergence, assuming the random
variables of interest are discrete or have densities with respect to Lebesgue measure.

12
Lexture Notes on Statistics and Information Theory John Duchi

Entropy: We begin with a central concept in information theory: the entropy. Let P be a distri-
bution on a finite (or countable) set X , and let p denote the probability mass function associated
with P . That is, if X is a random variable distributed according to P , then P (X = x) = p(x). The
entropy of X (or of P ) is defined as
X
H(X) := − p(x) log p(x).
x

Because p(x) ≤ 1 for all x, it is clear that this quantity is positive. We will show later that if X
is finite, the maximum entropy distribution on X is the uniform distribution, setting p(x) = 1/|X |
for all x, which has entropy log(|X |).
Later in the class, we provide a number of operational interpretations of the entropy. The
most common interpretation—which forms the beginning of Shannon’s classical information the-
ory [158]—is via the source-coding theorem. We present Shannon’s source coding theorem in
Section 2.4.1, where we show that if we wish to encode a random variable X, distributed according
to P , with a k-ary string (i.e. each entry of the string takes
P on one of k values), then the minimal
expected length of the encoding is given by H(X) = − x p(x) logk p(x). Moreover, this is achiev-
able (to within a length of at most 1 symbol) by using Huffman codes (among many other types of
codes). As an example of this interpretation, we may consider encoding a random variable X with
equi-probable distribution on m items, which has H(X) = log(m). In base-2, this makes sense: we
simply assign an integer to each item and encode each integer with the natural (binary) integer
encoding of length dlog me.
We can also define the conditional entropy, which is the amount of information left in a random
variable after observing another. In particular, we define
X X
H(X | Y = y) = − p(x | y) log p(x | y) and H(X | Y ) = p(y)H(X | Y = y),
x y

where p(x | y) is the p.m.f. of X given that Y = y.


Let us now provide a few examples of the entropy of various discrete random variables
Example 2.1.1 (Uniform random variables): As we noted earlier, if a random variable X is
uniform on a set of size m, then H(X) = log m. 3

Example 2.1.2 (Bernoulli random variables): Let h2 (p) = −p log p − (1 − p) log(1 − p) denote
the binary entropy, which is the entropy of a Bernoulli(p) random variable. 3

Example 2.1.3 (Geometric random variables): A random variable X is Geometric(p), for


some p ∈ [0, 1], if it is supported on {1, 2, . . .}, and P (X = k) = (1 − p)k−1 p; this is the
probability distribution of the number X of Bernoulli(p) trials until a single success. The
entropy of such a random variable is

X ∞
X
k−1
H(X) = − (1 − p) p [(k − 1) log(1 − p) + log p] = − (1 − p)k p [k log(1 − p) + log p] .
k=1 k=0
P∞ k 1 d 1 1 P∞ k−1 ,
As k=0 α = 1−α and dα 1−α = (1−α)2
= k=1 kα we have
∞ ∞
X X 1−p
H(X) = −p log(1 − p) · k(1 − p)k − p log p · (1 − p)k = − log(1 − p) − (1 − p) log p.
p
k=1 k=1

As p ↓ 0, we see that H(X) ↑ ∞. 3

13
Lexture Notes on Statistics and Information Theory John Duchi

Example 2.1.4 (A random variable with infinite entropy): While most “reasonable” discrete
random variables have finite entropy, it is possible to construct distributions with infinite
entropy. Indeed, let X have p.m.f. on {2, 3, . . .} defined by

A −1
X 1
p(k) = 2 where A = < ∞,
k log k k=2
k log2 k
R∞ Rx
the last sum finite as 2 x log1 α x dx < ∞ if and only if α > 1: for α = 1, we have e 1
t log t =
log log x, while for α > 1, we have
d 1
(log x)1−α = (1 − α)
dx x logα x
R∞ 1 1
so that e t logα t dt = e(1−α) . To see that the entropy is infinite, note that
X log A + log k + 2 log log k X log k
H(X) = A 2 ≥A 2 − C = ∞,
k≥2
k log k k≥2
k log k

where C is a numerical constant. 3

KL-divergence: Now we define two additional quantities, which are actually much more funda-
mental than entropy: they can always be defined for any distributions and any random variables,
as they measure distance between distributions. Entropy simply makes no sense for non-discrete
random variables, let alone random variables with continuous and discrete components, though it
proves useful for some of our arguments and interpretations.
Before defining these quantities, we recall the definition of a convex function f : Rk → R as any
bowl-shaped function, that is, one satisfying
f (λx + (1 − λ)y) ≤ λf (x) + (1 − λ)f (y) (2.1.1)
for all λ ∈ [0, 1], all x, y. The function f is strictly convex if the convexity inequality (2.1.1) is
strict for λ ∈ (0, 1) and x 6= y. We recall a standard result:
Proposition 2.1.5 (Jensen’s inequality). Let f be convex. Then for any random variable X,
f (E[X]) ≤ E[f (X)].
Moreover, if f is strictly convex, then f (E[X]) < E[f (X)] unless X is constant.
Now we may define and provide a few properties of the KL-divergence. Let P and Q be
distributions defined on a discrete set X . The KL-divergence between them is
X p(x)
Dkl (P ||Q) := p(x) log .
q(x)
x∈X

We observe immediately that Dkl (P ||Q) ≥ 0. To see this, we apply Jensen’s inequality (Propo-
sition 2.1.5) to the function − log and the random variable q(X)/p(X), where X is distributed
according to P :
   
q(X) q(X)
Dkl (P ||Q) = −E log ≥ − log E
p(X) p(X)
X 
q(x)
= − log p(x) = − log(1) = 0.
x
p(x)

14
Lexture Notes on Statistics and Information Theory John Duchi

Moreover, as log is strictly convex, we have Dkl (P ||Q) > 0 unless P = Q. Another consequence of
the positivity of the KL-divergence is that whenever the set X is finite with cardinality |X | < ∞,
for any random variable X supported on X we have H(X) ≤ log |X |. Indeed, letting m = |X |, Q
1
be the uniform distribution on X so that q(x) = m , and X have distribution P on X , we have
X p(x) X
0 ≤ Dkl (P ||Q) = p(x) log = −H(X) − p(x) log q(x) = −H(X) + log m, (2.1.2)
x
q(x) x

so that H(X) ≤ log m. Thus, the uniform distribution has the highest entropy over all distributions
on the set X .

Mutual information: Having defined KL-divergence, we may now describe the information
content between two random variables X and Y . The mutual information I(X; Y ) between X and
Y is the KL-divergence between their joint distribution and their products (marginal) distributions.
More mathematically,
X p(x, y)
I(X; Y ) := p(x, y) log . (2.1.3)
x,y
p(x)p(y)

We can rewrite this in several ways. First, using Bayes’ rule, we have p(x, y)/p(y) = p(x | y), so
X p(x | y)
I(X; Y ) = p(y)p(x | y) log
x,y
p(x)
XX X X
=− p(y)p(x | y) log p(x) + p(y) p(x | y) log p(x | y)
x y y x

= H(X) − H(X | Y ).

Similarly, we have I(X; Y ) = H(Y ) − H(Y | X), so mutual information can be thought of as the
amount of entropy removed (on average) in X by observing Y . We may also think of mutual infor-
mation as measuring the similarity between the joint distribution of X and Y and their distribution
when they are treated as independent.
Comparing the definition (2.1.3) to that for KL-divergence, we see that if PXY is the joint
distribution of X and Y , while PX and PY are their marginal distributions (distributions when X
and Y are treated independently), then

I(X; Y ) = Dkl (PXY ||PX × PY ) ≥ 0.

Moreover, we have I(X; Y ) > 0 unless X and Y are independent.


As with entropy, we may also define the conditional information between X and Y given Z,
which is the mutual information between X and Y when Z is observed (on average). That is,
X
I(X; Y | Z) := I(X; Y | Z = z)p(z) = H(X | Z) − H(X | Y, Z) = H(Y | Z) − H(Y | X, Z).
z

Entropies of continuous random variables For continuous random variables, we may define
an analogue of the entropy known as differential entropy, which for a random variable X with
density p is defined by Z
h(X) := − p(x) log p(x)dx. (2.1.4)

15
Lexture Notes on Statistics and Information Theory John Duchi

Note that the differential entropy may be negative—it is no longer directly a measure of the number
of bits required to describe a random variable X (on average), as was the case for the entropy. We
can similarly define the conditional entropy
Z Z
h(X | Y ) = − p(y) p(x | y) log p(x | y)dxdy.

We remark that the conditional differential entropy of X given Y for Y with arbitrary distribution—
so long as X has a density—is
 Z 
h(X | Y ) = E − p(x | Y ) log p(x | Y )dx ,

where p(x | y) denotes the conditional density of X when Y = y. The KL divergence between
distributions P and Q with densities p and q becomes
Z
p(x)
Dkl (P ||Q) = p(x) log dx,
q(x)
and similarly, we have the analogues of mutual information as
Z
p(x, y)
I(X; Y ) = p(x, y) log dxdy = h(X) − h(X | Y ) = h(Y ) − h(Y | X).
p(x)p(y)
As we show in the next subsection, we can define the KL-divergence between arbitrary distributions
(and mutual information between arbitrary random variables) more generally without requiring
discrete or continuous distributions. Before investigating these issues, however, we present a few
examples. We also see immediately that for X uniform on a set [a, b], we have h(X) = log(b − a).
Example 2.1.6 (Entropy of normal random variables): The differential entropy (2.1.4) of
a normal random variable is straightforward to compute. Indeed, for X ∼ N(µ, σ 2 ) we have
p(x) = √ 1 2 exp(− 2σ1 2 (x − µ)2 ), so that
2πσ

E[(X − µ)2 ]
Z  
1 1 1 2 1 2 1
h(X) = − p(x) log − (x − µ) = log(2πσ ) + = log(2πeσ 2 ).
2 2πσ 2 2σ 2 2 2σ 2 2
For a general multivariate Gaussian, where X ∼ N(µ, Σ) for a vector µ ∈ Rn and Σ  0 with
density p(x) = n/2
1
√ exp(− 21 (x − µ)> Σ−1 (x − µ)), we similarly have
(2π) det(Σ)

1 h i
h(X) = E n log(2π) + log det(Σ) + (X − µ)> Σ−1 (X − µ)
2
n 1 1 n 1
= log(2π) + log det(Σ) + tr(ΣΣ−1 ) = log(2πe) + log det(eΣ).
2 2 2 2 2
3
Continuing our examples with normal distributions, we may compute the divergence between
two multivariate Gaussian distributions:
Example 2.1.7 (Divergence between Gaussian distributions): Let P be the multivariate
normal N(µ1 , Σ), and Q be the multivariate normal distribution with mean µ2 and identical
covariance Σ  0. Then we have that
1
Dkl (P ||Q) = (µ1 − µ2 )> Σ−1 (µ1 − µ2 ). (2.1.5)
2
We leave the computation of the identity (2.1.5) to the reader. 3

16
Lexture Notes on Statistics and Information Theory John Duchi

An interesting consequence of Example 2.1.7 is that if a random vector X has a given covari-
ance Σ ∈ Rn×n , then the multivariate Gaussian with identical covariance has larger differential
entropy. Put another way, differential entropy for random variables with second moments is always
maximized by the Gaussian distribution.
Proposition 2.1.8. Let X be a random vector on Rn with a density, and assume that Cov(X) = Σ.
Then for Z ∼ N(0, Σ), we have
h(X) ≤ h(Z).
Proof Without loss of generality, we assume that X has mean 0. Let P be the distribution of
X with density p, and let Q be multivariate normal with mean 0 and covariance Σ; let Z be this
random variable. Then
Z Z  
p(x) n 1 > −1
Dkl (P ||Q) = p(x) log dx = −h(X) + p(x) log(2π) − x Σ x dx
q(x) 2 2
= −h(X) + h(Z),
because Z has the same covariance as X. As 0 ≤ Dkl (P ||Q), we have h(Z) ≥ h(X) as desired.

We remark in passing that the fact that Gaussian random variables have the largest entropy has
been used to prove stronger variants of the central limit theorem; see the original results of Barron
[16], as well as later quantitative results on the increase of entropy of normalized sums by Artstein
et al. [9] and Madiman and Barron [134].

2.1.2 Chain rules and related properties


We now illustrate several of the properties of entropy, KL divergence, and mutual information;
these allow easier calculations and analysis.

Chain rules: We begin by describing relationships between collections of random variables


X1 , . . . , Xn and individual members of the collection. (Throughout, we use the notation Xij =
(Xi , Xi+1 , . . . , Xj ) to denote the sequence of random variables from indices i through j.)
For the entropy, we have the simplest chain rule:
H(X1 , . . . , Xn ) = H(X1 ) + H(X2 | X1 ) + . . . + H(Xn | X1n−1 ).
This follows from the standard decomposition of a probability distribution p(x, y) = p(x)p(y | x).
to see the chain rule, then, note that
X
H(X, Y ) = − p(x)p(y | x) log p(x)p(y | x)
x,y
X X X X
=− p(x) p(y | x) log p(x) − p(x) p(y | x) log p(y | x) = H(X) + H(Y | X).
x y x y

Now set X = X1n−1 ,Y = Xn , and simply induct.


A related corollary of the definitions of mutual information is the well-known result that con-
ditioning reduces entropy:
H(X | Y ) ≤ H(X) because I(X; Y ) = H(X) − H(X | Y ) ≥ 0.
So on average, knowing about a variable Y can only decrease your uncertainty about X. That
conditioning reduces entropy for continuous random variables is also immediate, as for X continuous
we have I(X; Y ) = h(X) − h(X | Y ) ≥ 0, so that h(X) ≥ h(X | Y ).

17
Lexture Notes on Statistics and Information Theory John Duchi

Chain rules for information and divergence: As another immediate corollary to the chain
rule for entropy, we see that mutual information also obeys a chain rule:
n
X
I(X; Y1n ) = I(X; Yi | Y1i−1 ).
i=1

Indeed, we have
n
X n
 X
I(X; Y1n ) = H(Y1n ) − H(Y1n | X) = H(Yi | Y1i−1 ) − H(Yi | X, Y1i−1 ) = I(X; Yi | Y1i−1 ).


i=1 i=1

The KL-divergence obeys similar chain rules, making mutual information and KL-divergence mea-
sures useful tools for evaluation of distances and relationships between groups of random variables.
As a second example, suppose that the distribution P = P1 ×P2 ×· · ·×Pn , and Q = Q1 ×· · ·×Qn ,
that is, that P and Q are product distributions over independent random variables Xi ∼ Pi or
Xi ∼ Qi . Then we immediately have the tensorization identity
n
X
Dkl (P ||Q) = Dkl (P1 × · · · × Pn ||Q1 × · · · × Qn ) = Dkl (Pi ||Qi ) .
i=1

We remark in passing that these two identities hold for arbitrary distributions Pi and Qi or random
variables X, Y . As a final tensorization identiy, we consider a more general chain rule for KL-
divergences, which will frequently be useful. We abuse notation temporarily, and for random
variables X and Y with distributions P and Q, respectively, we denote

Dkl (X||Y ) := Dkl (P ||Q) .

In analogy to the entropy, we can also define the conditional KL divergence. Let X and Y have
distributions PX|z and PY |z conditioned on Z = z, respectively. Then we define

Dkl (X||Y | Z) = EZ [Dkl PX|Z ||PY |Z ],
P 
so that if Z is discrete we have Dkl (X||Y | Z) = z p(z)Dkl PX|z ||PY |z . With this notation, we
have the chain rule
n
X
Dkl Xi ||Yi | X1i−1 ,

Dkl (X1 , . . . , Xn ||Y1 , . . . , Yn ) = (2.1.6)
i=1

because (in the discrete case, which—as we discuss presently—is fully general for this purpose) for
distributions PXY and QXY we have
 
X p(x, y) X p(y | x) p(x)
Dkl (PXY ||QXY ) = p(x, y) log = p(x)p(y | x) log + log
x,y
q(x, y) x,y
q(y | x) q(x)
X p(x) X X p(y | x)
= p(x) log + p(x) p(y | x) log ,
x
q(x) x y
q(y | x)
P
where the final equality uses that y p(y | x) = 1 for all x.
Expanding upon this, we give several tensorization identities, showing how to transform ques-
tions about the joint distribution of many random variables to simpler questions about their

18
Lexture Notes on Statistics and Information Theory John Duchi

marginals. As a first example, we see that as a consequence of the fact that conditioning de-
creases entropy, we see that for any sequence of (discrete or continuous, as appropriate) random
variables, we have

H(X1 , . . . , Xn ) ≤ H(X1 ) + · · · + H(Xn ) and h(X1 , . . . , Xn ) ≤ h(X1 ) + . . . + h(Xn ).

Both equalities hold with equality if and only if X1 , . . . , Xn are mutually independent. (The only
if follows because I(X; Y ) > 0 whenever X and Y are not independent, by Jensen’s inequality and
the fact that Dkl (P ||Q) > 0 unless P = Q.)
We return to information and divergence now. Suppose that random variables Yi are indepen-
dent conditional on X, meaning that

P (Y1 = y1 , . . . , Yn = yn | X = x) = P (Y1 = y1 | X = x) · · · P (Yn = yn | X = x).

Such scenarios are common—as we shall see—when we make multiple observations from a fixed
distribution parameterized by some X. Then we have the inequality
n
X
I(X; Y1 , . . . , Yn ) = [H(Yi | Y1i−1 ) − H(Yi | X, Y1i−1 )]
i=1
n n n (2.1.7)
X X X
= [H(Yi | Y1i−1 ) − H(Yi | X)] ≤ [H(Yi ) − H(Yi | X)] = I(X; Yi ),
i=1 i=1 i=1

where the inequality follows because conditioning reduces entropy.

2.1.3 Data processing inequalities:


A standard problem in information theory (and statistical inference) is to understand the degrada-
tion of a signal after it is passed through some noisy channel (or observation process). The simplest
of such results, which we will use frequently, is that we can only lose information by adding noise.
In particular, assume we have the Markov chain

X → Y → Z.

Then we obtain the classical data processing inequality.

Proposition 2.1.9. With the above Markov chain, we have I(X; Z) ≤ I(X; Y ).

Proof We expand the mutual information I(X; Y, Z) in two ways:

I(X; Y, Z) = I(X; Z) + I(X; Y | Z)


= I(X; Y ) + I(X; Z | Y ),
| {z }
=0

where we note that the final equality follows because X is independent of Z given Y :

I(X; Z | Y ) = H(X | Y ) − H(X | Y, Z) = H(X | Y ) − H(X | Y ) = 0.

Since I(X; Y | Z) ≥ 0, this gives the result.

19
Lexture Notes on Statistics and Information Theory John Duchi

There are related data processing inequalities for the KL-divergence—which we generalize in
the next section—as well. In this case, we may consider a simple Markov chain X → Z. If we
let P1 and
R P2 be distributions on X and Q1 and Q2 be the induced distributions on Z, that is,
Qi (A) = P(Z ∈ A | x)dPi (x), then we have
Dkl (Q1 ||Q2 ) ≤ Dkl (P1 ||P2 ) ,
the basic KL-divergence data processing inequality. A consequence of this is that, for any function
f and random variables X and Y on the same space, we have
Dkl (f (X)||f (Y )) ≤ Dkl (X||Y ) .
We explore these data processing inequalities more when we generalize KL-divergences in the next
section and in the exercises.

2.2 General divergence measures and definitions


Having given our basic definitions of mutual information and divergence, we now show how the
definitions of KL-divergence and mutual information extend to arbitrary distributions P and Q
and arbitrary sets X . This requires a bit of setup, including defining set algebras (which, we will
see, simply correspond to quantization of the set X ), but allows us to define divergences in full
generality.

2.2.1 Partitions, algebras, and quantizers


Let X be an arbitrary space. A quantizer on X is any function that maps X to a finite collection
of integers. That is, fixing m < ∞, a quantizer is any function q : X → {1, . . . , m}. In particular,
a quantizer q partitions the space X into the subsets of x ∈ X for which q(x) = i. A related
notion—we will see the precise relationship presently—is that of an algebra of sets on X . We say
that a collection of sets A is an algebra on X if the following are true:
1. The set X ∈ A.
2. The collection of sets A is closed under finite set operations: union, intersection, and com-
plementation. That is, A, B ∈ A implies that Ac ∈ A, A ∩ B ∈ A, and A ∪ B ∈ A.
There is a 1-to-1 correspondence between quantizers—and their associated partitions of the set
X —and finite algebras on a set X , which we discuss briefly.1 It should be clear that there is a
one-to-one correspondence between finite partitions of the set X and quantizers q, so we must argue
that finite partitions of X are in one-to-one correspondence with finite algebras defined over X .
In one direction, we may consider a quantizer q : X → {1, . . . , m}. Let the sets A1 , . . . , Am
be the partition associated with q, that is, for x ∈ Ai we have q(x) = i, or Ai = q−1 ({i}). Then
we may define an algebra Aq as the collection of all finite set operations performed on A1 , . . . , Am
(note that this is a finite collection, as finite set operations performed on the partition A1 , . . . , Am
induce only a finite collection of sets).
For the other direction, consider a finite algebra A over the set X . We can then construct a
quantizer qA that corresponds to this algebra. To do so, we define an atom of A as any non-empty
set A ∈ A such that if B ⊂ A and B ∈ A, then B = A or B = ∅. That is, the atoms of A are the
“smallest” sets in A. We claim there is a unique partition of X with atomic sets from A; we prove
this inductively.
1
Pedantically, this one-to-one correspondence holds up to permutations of the partition induced by the quantizer.

20
Lexture Notes on Statistics and Information Theory John Duchi

Base case: There is at least 1 atomic set, as A is finite; call it A1 .

Induction step: Assume we have atomic sets A1 , . . . , Ak ∈ A. Let B = (A1 ∪ · · · ∪ Ak )c be their


complement, which we assume is non-empty (otherwise we have a partition of X into atomic sets).
The complement B is either atomic, in which case the sets {A1 , A2 , . . . , Ak , B} are a partition of
X consisting of atoms of A, or B is not atomic. If B is not atomic, consider all the sets of the form
A ∩ B for A ∈ A. Each of these belongs to A, and at least one of them is atomic, as there is a
finite number of them. This means there is a non-empty set Ak+1 ⊂ B such that Ak+1 is atomic.
By repeating this induction, which must stop at some finite index m as A is finite, we construct
a collection A1 , . . . , Am of disjoint atomic sets in A for which and ∪i Ai = X . (The uniqueness is
an exercise for the reader.) Thus we may define the quantizer qA via

qA (x) = i when x ∈ Ai .

2.2.2 KL-divergence
In this section, we present the general definition of a KL-divergence, which holds for any pair of
distributions. Let P and Q be distributions on a space X . Now, let A be a finite algebra on X
(as in the previous section, this is equivalent to picking a partition of X and then constructing the
associated algebra), and assume that its atoms are atoms(A). The KL-divergence between P and
Q conditioned on A is
X P (A)
Dkl (P ||Q | A) := P (A) log .
Q(A)
A∈atoms(A)

That is, we simply sum over the partition of X . Another way to write this is as follows. Let
q : X → {1, . . . , m} be a quantizer, and define the sets Ai = q−1 ({i}) to be the pre-images of each
i (i.e. the different quantization regions, or the partition of X that q induces). Then the quantized
KL-divergence between P and Q is
m
X P (Ai )
Dkl (P ||Q | q) := P (Ai ) log .
Q(Ai )
i=1

We may now give the fully general definition of KL-divergence: the KL-divergence between P
and Q is defined as

Dkl (P ||Q) := sup {Dkl (P ||Q | A) such that A is a finite algebra on X }


(2.2.1)
= sup {Dkl (P ||Q | q) such that q quantizes X } .

This also gives a rigorous definition of mutual information. Indeed, if X and Y are random variables
with joint distribution PXY and marginal distributions PX and PY , we simply define

I(X; Y ) = Dkl (PXY ||PX × PY ) .

When P and Q have densities p and q, the definition (2.2.1) reduces to


Z
p(x)
Dkl (P ||Q) = p(x) log dx,
R q(x)

21
Lexture Notes on Statistics and Information Theory John Duchi

while if P and Q both have probability mass functions p and q, then—as we see in Exercise 2.6—the
definition (2.2.1) is equivalent to
X p(x)
Dkl (P ||Q) = p(x) log ,
x
q(x)

precisely as in the discrete case.


We remark in passing that if the set X is a product space, meaning that X = X1 × X2 × · · · × Xn
for some n < ∞ (this is the case for mutual information, for example), then we may assume our
quantizer always quantizes sets of the form A = A1 × A2 × · · · × An , that is, Cartesian products.
Written differently, when we consider algebras on X , the atoms of the algebra may be assumed to be
Cartesian products of sets, and our partitions of X can always be taken as Cartesian products. (See
Gray [94, Chapter 5].) Written slightly differently, if P and Q are distributions on X = X1 ×· · ·×Xn
and qi is a quantizer for the set Xi (inducing the partition Ai1 , . . . , Aimi of Xi ) we may define

X P (A1j1 × A2j2 × · · · × Anjn )


Dkl P ||Q | q1 , . . . , qn = P (A1j1 × A2j2 × · · · × Anjn ) log

Q(A 1 × A2 × · · · × An ) .
j ,...,j
1 n
j1 j2 jn

Then the general definition (2.2.1) of KL-divergence specializes to

Dkl (P ||Q) = sup Dkl P ||Q | q1 , . . . , qn such that qi quantizes Xi .


 

So we only need consider “rectangular” sets in the definitions of KL-divergence.

Measure-theoretic definition of KL-divergence If you have never seen measure theory be-
fore, skim this section; while the notation may be somewhat intimidating, it is fine to always
consider only continuous or fully discrete distributions. We will describe an interpretation that will
mean for our purposes that one never needs to really think about measure theoretic issues.
The general definition (2.2.1) of KL-divergence is equivalent to the following. Let µ be a measure
on X , and assume that P and Q are absolutely continuous with respect to µ, with densities p and
q, respectively. (For example, take µ = P + Q.) Then
Z
p(x)
Dkl (P ||Q) = p(x) log dµ(x). (2.2.2)
X q(x)

The proof of this fact is somewhat involved, requiring the technology of Lebesgue integration. (See
Gray [94, Chapter 5].)
For those who have not seen measure theory, the interpretation
R of the equality (2.2.2) should be
as follows. When integrating a function f (x), replace f (x)dµ(x) with one of two pairsR of symbols:
one may simply think of dµ(x) as dx, so thatR we are performing standard integration f (x)dx, or
one should think
R of the integral
P operation f (x)dµ(x) as summing the argument of the integral, so
dµ(x) = 1 and f (x)dµ(x) = x f (x). (This corresponds to µ being “counting measure” on X .)

2.2.3 f -divergences
A more general notion of divergence is the so-called f -divergence, or Ali-Silvey divergence [4, 54]
(see also the alternate interpretations in the article by Liese and Vajda [131]). Here, the definition
is as follows. Let P and Q be probability distributions on the set X , and let f : R+ → R be a

22
Lexture Notes on Statistics and Information Theory John Duchi

convex function satisfying f (1) = 0. If X is a discrete set, then the f -divergence between P and Q
is  
X p(x)
Df (P ||Q) := q(x)f .
x
q(x)
More generally, for any set X and a quantizer q : X → {1, . . . , m}, letting Ai = q−1 ({i}) = {x ∈
X | q(x) = i} be the partition the quantizer induces, we can define the quantized divergence
m  
X P (Ai )
Df (P ||Q | q) = Q(Ai )f ,
Q(Ai )
i=1

and the general definition of an f divergence is (in analogy with the definition (2.2.1) of general
KL divergences)

Df (P ||Q) := sup {Df (P ||Q | q) such that q quantizes X } . (2.2.3)

The definition (2.2.3) shows that, any time we have computations involving f -divergences—such
as KL-divergence or mutual information—it is no loss of generality, when performing the compu-
tations, to assume that all distributions have finite discrete support. There is a measure-theoretic
version of the definition (2.2.3) which is frequently easier to use. Assume w.l.o.g. that P and Q are
absolutely continuous with respect to the base measure µ. The f divergence between P and Q is
then Z  
p(x)
Df (P ||Q) := q(x)f dµ(x). (2.2.4)
X q(x)
This definition, it turns out, is not quite as general as we would like—in particular, it is unclear
how we should define the integral for points x such that q(x) = 0. With that in mind, we recall
that the perspective transform (see Appendices B.1.1 and B.3.3) of a function f : R → R is defined
by pers(f )(t, u) = uf (t/u) if u > 0 and by +∞ if u ≤ 0. This function is convex in its arguments
(Proposition B.3.12). In fact, this is not quite enough for the fully correct definition. The closure of
a convex function f is cl f (x) = sup{`(x) | ` ≤ f, ` linear}, the supremum over all linear functions
that globally lower bound f . Then [104, Proposition IV.2.2.2] the closer of pers(f ) is defined, for
any t0 ∈ int dom f , by

uf (t/u)
 if u > 0
cl pers(f )(t, u) = limα↓0 αf (t0 − t + t/α) if u = 0

+∞ if u < 0.

(The choice of t0 does not affect the definition.) Then the fully general formula expressing the
f -divergence is Z
Df (P ||Q) = cl pers(f )(p(x), q(x))dµ(x). (2.2.5)
X
This is what we mean by equation (2.2.4), which we use without comment.
In the exercises, we explore several properties of f -divergences, including the quantized repre-
sentation (2.2.3), showing different data processing inequalities and orderings of quantizers based
on the fineness of their induced partitions. Broadly, f -divergences satisfy essentially the same prop-
erties as KL-divergence, such as data-processing inequalities, and they provide a generalization of
mutual information. We explore f -divergences from additional perspectives later—they are impor-
tant both for optimality in estimation and related to consistency and prediction problems, as we
discuss in Chapter 14.

23
Lexture Notes on Statistics and Information Theory John Duchi

Examples We give several examples of f -divergences here; in Section 8.2.2 we provide a few
examples of their uses as well as providing a few natural inequalities between them.

Example 2.2.1 (KL-divergence): By taking f (t) = t log t, which is convex and satisfies
f (1) = 0, we obtain Df (P ||Q) = Dkl (P ||Q). 3

Example 2.2.2 (KL-divergence, reversed): By taking f (t) = − log t, we obtain Df (P ||Q) =


Dkl (Q||P ). 3

Example 2.2.3 (Total variation distance): The total variation distance between probability
distributions P and Q defined on a set X is the maximum difference between probabilities they
assign on subsets of X :

kP − QkTV := sup |P (A) − Q(A)| = sup (P (A) − Q(A)), (2.2.6)


A⊂X A⊂X

where the second equality follows by considering compliments P (Ac ) = 1 − P (A). The total
variation distance, as we shall see later, is important for verifying the optimality of different
tests, and appears in the measurement of difficulty of solving hypothesis testing problems. The
choice f (t) = 21 |t − 1|, we obtain the total variation distance, that is, kP − QkTV = Df (P ||Q).
There are several alternative characterizations, which we provide as Lemma 2.2.4 next; it will
be useful in the sequel when we develop inequalities relating the divergences. 3

Lemma 2.2.4. Let P, Q be probability measures with densities p, q with respect to a base measure
µ and f (t) = 21 |t − 1|. Then
Z
1
kP − QkTV = Df (P ||Q) = |p(x) − q(x)|dµ(x)
2
Z Z
= [p(x) − q(x)]+ dµ(x) = [q(x) − p(x)]+ dµ(x)

= P (dP/dQ > 1) − Q(dP/dQ > 1) = Q(dQ/dP > 1) − P (dQ/dP > 1).

In particular, the set A = {x | p(x)/q(x) ≥ 1} maximizes P (B)−Q(B) over B ⊂ X and so achieves


kP − QkTV = P (A) − Q(A).
Proof Eliding the measure-theoretic details,2 we immediately have
Z Z
1 p(x) 1
Df (P ||Q) = − 1 q(x)dµ(x) = |p(x) − q(x)|dµ(x)
2 q(x) 2
Z Z
1 1
= [p(x) − q(x)] dµ(x) + [q(x) − p(x)] dµ(x)
2 x:p(x)>q(x) 2 x:q(x)>p(x)
Z Z
1 1
= [p(x) − q(x)]+ dµ(x) + [q(x) − p(x)]+ dµ(x).
2 2
R
Considering the last inegral [q(x) − p(x)]+ dµ(x), we see that the set A = {x : q(x) > p(x)}
satisfies
Z Z
Q(A) − P (A) = (q(x) − p(x))dµ(x) ≥ (q(x) − p(x))dµ(x) = Q(B) − P (B)
A B
2
R To make thisRfully rigorous, we Rwould use the Hahn decomposition of the signed measure P − Q to recognize that
f (dP − dQ) = f [dP − dQ]+ − f [dQ − dP ]+ for any integrable f .

24
Lexture Notes on Statistics and Information Theory John Duchi

for any set B, as any x ∈ B \ A clearly satisfies q(x) − p(x) ≤ 0.

Example 2.2.5 (Hellinger distance): The Hellinger distance between √ probability distribu-

2
tions P and Q defined on a set X is generated by the function f (t) = ( t − 1) = t − 2 t + 1.
The Hellinger distance is then
Z p
2 1 p
dhel (P, Q) := ( p(x) − q(x))2 dµ(x). (2.2.7)
2
The non-squared version dhel (P, Q) is indeed a distance between probability measures P and
Q. It is sometimes convenient to rewrite the Hellinger distance in terms of the affinity between
P and Q, as
Z Z p
1 p
dhel (P, Q)2 = (p(x) + q(x) − 2 p(x)q(x))dµ(x) = 1 − p(x)q(x)dµ(x), (2.2.8)
2
which makes clear that dhel (P, Q) ∈ [0, 1] is on roughly the same scale as the variation distance;
we will say more later. 3
Example 2.2.6 (χ2 divergence): The χ2 -divergence is generated by taking f (t) = (t − 1)2 ,
so that 2
p(x)2
Z  Z
p(x)
Dχ2 (P ||Q) := − 1 q(x)dµ(x) = dµ(x) − 1, (2.2.9)
q(x) q(x)
where the equality is immediate because pdµ = qdµ = 1. 3
R R

2.2.4 Inequalities and relationships between divergences


Important to our development will come will be different families of inequalities relating the different
divergence measures. These inequalities will be particularly important because, in some cases,
different distributions admit easy calculations with some divergences, such as KL or χ2 divergence,
but it can be challenging to work with others that may be more “natural” for a particular problem.
Most importantly, replacing a variation distance by bounding it with an alternative divergence is
often convenient for analyzing the properties of product distributions (as will become apparent
in Chapter 8). We record several of these results here, making a passing connection to mutual
information as well.
The first inequality shows that the Hellinger distance and variation distance roughly generate
the same topology on collections of distributions, as they upper and lower bound the other (if we
tolerate polynomial losses).
Proposition 2.2.7. The total variation distance and Hellinger distance satisfy
q
2
dhel (P, Q) ≤ kP − QkTV ≤ dhel (P, Q) 2 − d2hel (P, Q).
Proof We begin with the upper bound. We have by Hölder’s inequality that
Z Z p
1 p p p
|p(x) − q(x)|dµ(x) = | p(x) − q(x)| · | p(x) + q(x)|dµ(x)
2
 Z 1  Z 1
1 p p 2
2 1 p p 2
2
≤ ( p(x) − q(x)) dµ(x) ( p(x) + q(x)) dµ(x)
2 2
 Z p 1
2
= dhel (P, Q) 1 + p(x)q(x)dµ(x) .

25
Lexture Notes on Statistics and Information Theory John Duchi

Rp
As in Example 2.2.5, we have p(x)q(x)dµ(x) = 1 − dhel (P, Q)2 , so this (along with the repre-
sentation Lemma 2.2.4 for variation distance) implies
Z
1 1
kP − QkTV = |p(x) − q(x)|dµ(x) ≤ dhel (P, Q)(2 − d2hel (P, Q)) 2 .
2

For the lower bound on total variation, note that for any a, b ∈ R+ , we have a + b − 2 ab ≤ |a − b|
(check the cases a > b and a < b separately); thus
Z Z
2 1 h p i 1
dhel (P, Q) = p(x) + q(x) − 2 p(x)q(x) dµ(x) ≤ |p(x) − q(x)|dµ(x),
2 2
as desired.

Several important inequalitites relate the variation distance to the KL-divergence. We state
two important inequalities in the next proposition, both of which are important enough to justify
their own names.

Proposition 2.2.8. The total variation distance satisfies the following relationships.

(a) Pinsker’s inequality: for any distributions P and Q,


1
kP − Qk2TV ≤ Dkl (P ||Q) . (2.2.10)
2

(b) The Bretagnolle-Huber inequality: for any distributions P and Q,


p 1
kP − QkTV ≤ 1 − exp(−Dkl (P ||Q)) ≤ 1 − exp(−Dkl (P ||Q)).
2

Proof Exercise 2.19 outlines one proof of Pinsker’s inequality using the data processing inequality
(Proposition 2.2.13). We present an alternative via the Cauchy-Schwarz inequality. Using the
definition (2.2.1) of the KL-divergence, we may assume without loss of generality that P and Q are
finitely P
supported, say with p.m.f.s p1 , . . . , pm and q1 , . . . , qm . Define the negative entropy function
h(p) = m 2 1 2
i=1 pi log pi . Then showing that Dkl (P ||Q) ≥ 2 kP − QkTV = 2 kp − qk1 is equivalent to
showing that
1
h(p) ≥ h(q) + h∇h(q), p − qi + kp − qk21 , (2.2.11)
2
because by inspection h(p)−h(q)−h∇h(q), p−qi = i pi log pqii . We do this via a Taylor expansion:
P
we have
∇h(p) = [log pi + 1]m 2
i=1 and ∇ h(p) = diag([1/pi ]i=1 ).
m

By Taylor’s theorem, there is some p̃ = (1 − t)p + tq, where t ∈ [0, 1], such that
1
h(p) = h(q) + h∇h(q), p − qi + hp − q, ∇2 h(p̃)(p − q)i.
2
P
But looking at the final quadratic, we have for any vector v and any p ≥ 0 satisfying i pi = 1,
m m m 2
v2 v2
X
X X √ |vi |
2
hv, ∇ h(p̃)vi = i
= kpk1 i
≥ pi √ = kvk21 ,
pi pi pi
i=1 i=1 i=1

26
Lexture Notes on Statistics and Information Theory John Duchi

√ √
where the inequality follows from Cauchy-Schwarz applied to the vectors [ pi ]i and [|vi |/ pi ]i .
Thus inequality (2.2.11) holds. Rp
For the claim (b), we use Proposition 2.2.7. Let a = p(x)q(x)dµ(x) be a shorthand
√ √ for the
affinity, so that d2 (P, Q) = 1 − a. Then Proposition 2.2.7 gives kP − Qk ≤ 1 − a 1+a =
√ hel TV
1 − a2 . Now apply Jensen’s inequality to the exponential: we have
Z p Z s Z  
q(x) 1 q(x)
p(x)q(x)dµ(x) = p(x)dµ(x) = exp log p(x)dµ(x)
p(x) 2 p(x)
 Z   
1 q(x) 1
≥ exp p(x) log dµ(x) = exp − Dkl (P ||Q) .
2 p(x) 2
√ q
In particular, 1 − a2 ≤ 1 − exp(− 12 Dkl (P ||Q))2 , which is the first claim of part (b). For the

second, note that 1 − c ≤ 1 − 12 c for c ∈ [0, 1] by concavity of the square root.

We also have the following bounds on the KL-divergence in terms of the χ2 -divergence.
Proposition 2.2.9. For any distributions P, Q,

Dkl (P ||Q) ≤ log(1 + Dχ2 (P ||Q)) ≤ Dχ2 (P ||Q) .

Proof By Jensen’s inequality, we have


dP 2
Z
Dkl (P ||Q) ≤ log = log(1 + Dχ2 (P ||Q)).
dQ
The second inequality is immediate as log(1 + t) ≤ t for all t > −1.

It is also possible to relate mutual information between distributions to f -divergences, and even
to bound the mutual information above and below by the Hellinger distance for certain problems. In
this case, we consider the following situation: let V ∈ {0, 1} uniformly at random, and conditional
on V = v, draw X ∼ Pv for some distribution Pv on a space X . Then we have that
1  1 
I(X; V ) = Dkl P0 ||P + Dkl P1 ||P
2 2
where P = 21 P0 + 12 P1 . The divergence measure on the right side of the preceding identity is a
special case of the Jenson-Shannon divergence, defined for λ ∈ [0, 1] by

Djs,λ (P ||Q) := λDkl (P ||λP + (1 − λ)Q) + Dkl (Q||λP + (1 − λ)Q) , (2.2.12)

which is a symmetrized and bounded variant of the typical KL-divergence (we use the shorthand
Djs (P ||Q) := Djs, 1 (P ||Q) for the symmetric case). As a consequence, we also have
2

1 1
I(X; V ) = Df (P0 ||P1 ) + Df (P1 ||P0 ) ,
2 2
1
where f (t) = −t log( 2t + 21 ) = t log t+1
2t
, so that the mutual information is a particular f -divergence.
This form—as we see in the later chapters—is frequently convenient because it gives an object
with similar tensorization properties to KL-divergence while enjoying the boundedness properties
of Hellinger and variation distances. The following proposition captures the latter properties.

27
Lexture Notes on Statistics and Information Theory John Duchi

Proposition 2.2.10. Let (X, V ) be distributed as above. Then


 
2 log 2 · kP0 − P1 kTV ,
log 2 · dhel (P0 , P1 ) ≤ I(X; V ) = Djs (P0 ||P1 ) ≤ min .
2 · d2hel (P0 , P1 )
Proof The lower bound and upper bound involving the variation distance both follow from
analytic bounds on the binary entropy functional h2 (p) = −p log p−(1−p) log(1−p). By expanding
the mutual information and letting p0 and p1 be densities of P0 and P1 with respect to some base
measure µ, we have
Z Z
2p0 2p1
2I(X; V ) = 2Djs (P0 ||P1 ) = p0 log dµ + p1 log dµ
p0 + p1 p0 + p1
Z  
p0 p0 p1 p1
= 2 log 2 + (p0 + p1 ) log + log dµ
p1 + p1 p0 + p1 p1 + p1 p0 + p1
Z  
p0
= 2 log 2 − (p0 + p1 )h2 dµ.
p1 + p0
We claim that p
2 log 2 · min{p, 1 − p} ≤ h2 (p) ≤ 2 log 2 · p(1 − p)
for all p ∈ [0, 1] (see Exercises 2.17 and 2.18). Then the upper and lower bounds on the information
become nearly immediate.
For the variation-based upper bound on I(X; V ), we use the lower bound h2 (p) ≥ 2 log 2 ·
min{p, 1 − p} to write
Z  
2 p0 (x) p1 (x)
I(X; V ) ≤ 2 − (p0 (x) + p1 (x)) min , dµ(x)
log 2 p0 (x) + p1 (x) p0 (x) + p1 (x)
Z
= 2 − 2 min{p0 (x), p1 (x)}dµ(x)
Z Z
= 2 (p1 (x) − min{p0 (x), p1 (x)})dµ(x) = 2 (p1 (x) − p0 (x))dµ(x).
p1 >p0

But of course the final integral is kP1 − P0 kTV , giving I(X; V ) ≤ log 2 kP0 −pP1 kTV . Conversely,
for the lower bound on Djs (P0 ||P1 ), we use the upper bound h2 (p) ≤ 2 log 2 · p(1 − p) to obtain
Z r
1 p0  p0 
I(X; V ) ≥ 1 − (p0 + p1 ) 1− dµ
log 2 p1 + p0 p1 + p0
√ √ √
Z Z
1
=1− p0 p1 dµ = ( p0 − p1 )2 dµ = d2hel (P0 , P1 )
2
as desired.
The Hellinger-based upper bound is simpler: by Proposition 2.2.9, we have
1 1
Djs (P0 ||P1 ) = Dkl (P0 ||(P0 + P1 )/2) + Dkl (P1 ||(P0 + P1 )/2)
2 2
1 1
≤ Dχ2 (P0 ||(P0 + P1 )/2) + Dχ2 (P1 ||(P0 + P1 )/2)
2 2
Z √ √ √ √
(p0 − p1 )2 ( p0 − p1 )2 ( p0 + p1 )2
Z
1 1
= dµ = dµ.
2 p0 + p1 2 p0 + p1
√ √
Now note that (a + b)2 ≤ 2aR2 + 2b2 for any a, b ∈ R, and so ( p0 + p1 )2 ≤ 2(p0 + p1 ), and thus
√ √
the final integral has bound ( p0 − p1 )2 dµ = 2d2hel (P0 , P1 ).

28
Lexture Notes on Statistics and Information Theory John Duchi

2.2.5 Convexity and data processing for divergence measures


f -divergences satisfy a number of very useful properties, which we use repeatedly throughout the
lectures. As the KL-divergence is an f -divergence, it of course satisfies these conditions; however,
we state them in fuller generality, treating the KL-divergence results as special cases and corollaries.
We begin by exhibiting the general data processing properties and convexity properties of f -
divergences, each of which specializes to KL divergence. We leave the proof of each of these as
exercises. First, we show that f -divergences are jointly convex in their arguments.

Proposition 2.2.11. Let P1 , P2 , Q1 , Q2 be distributions on a set X and f : R+ → R be convex.


Then for any λ ∈ [0, 1],

Df (λP1 + (1 − λ)P2 ||λQ1 + (1 − λ)Q2 ) ≤ λDf (P1 ||Q1 ) + (1 − λ)Df (P2 ||Q2 ) .

The proof of this proposition we leave as Exercise 2.11, which we treat as a consequence of the
more general “log-sum” like inequalities of Exercise 2.8. It is, however, an immediate consequence
of the fully specified definition (2.2.5) of an f -divergence, because pers(f ) is jointly convex. As an
immediate corollary, we see that the same result is true for KL-divergence as well.

Corollary 2.2.12. The KL-divergence Dkl (P ||Q) is jointly convex in its arguments P and Q.

We can also provide more general data processing inequalities for f -divergences, paralleling
those for the KL-divergence. In this case, we consider random variables X and Z on spaces X
and Z, respectively, and a Markov transition kernel K giving the Markov chain X → Z. That
is, K(· | x) is a probability distribution on Z for each x ∈ X , and conditioned on X = x, Z has
distribution K(· | x) so that K(A | x) = P(Z ∈ A | X = x). Certainly, this includes the situation
when Z = φ(X) for some function φ, and more generally when Z = φ(X, U ) for a function φ and
some additional randomness U . For a distribution P on X, we then define the marginals
Z
KP (A) := K(A, x)dP (x).
X

We then have the following proposition.

Proposition 2.2.13. Let P and Q be distributions on X and let K be any Markov kernel. Then

Df (KP ||KQ ) ≤ Df (P ||Q) .

See Exercise 2.10 for a proof.


As a corollary, we obtain the following data processing inequality for KL-divergences, where we
abuse notation to write Dkl (X||Y ) = Dkl (P ||Q) for random variables X ∼ P and Y ∼ Q.

Corollary 2.2.14. Let X, Y ∈ X be random variables, let U ∈ U be independent of X and Y , and


let φ : X × U → Z for some spaces X , U, Z. Then

Dkl (φ(X, U )||φ(Y, U )) ≤ Dkl (X||Y ) .

Thus, further processing of random variables can only bring them “closer” in the space of distribu-
tions; downstream processing of signals cannot make them further apart as distributions.

29
Lexture Notes on Statistics and Information Theory John Duchi

2.3 First steps into optimal procedures: testing inequalities


As noted in the introduction, a central benefit of the information theoretic tools we explore is that
they allow us to certify the optimality of procedures—that no other procedure could (substantially)
improve upon the one at hand. The main tools for these certifications are often inequalities gov-
erning the best possible behavior of a variety of statistical tests. Roughly, we put ourselves in the
following scenario: nature chooses one of a possible set of (say) k worlds, indexed by probabil-
ity distributions P1 , P2 , . . . , Pk , and conditional on nature’s choice of the world—the distribution
P ? ∈ {P1 , . . . , Pk } chosen—we observe data X drawn from P ? . Intuitively, it will be difficult to
decide which distribution Pi is the true P ? if all the distributions are similar—the divergence be-
tween the Pi is small, or the information between X and P ? is negligible—and easy if the distances
between the distributions Pi are large. With this outline in mind, we present two inequalities, and
first examples of their application, to make concrete these connections to the notions of information
and divergence defined in this section.

2.3.1 Le Cam’s inequality and binary hypothesis testing


The simplest instantiation of the above setting is the case when there are only two possible dis-
tributions, P1 and P2 , and our goal is to make a decision on whether P1 or P2 is the distribution
generating data we observe. Concretely, suppose that nature chooses one of the distributions P1
or P2 at random, and let V ∈ {1, 2} index this choice. Conditional on V = v, we then observe a
sample X drawn from Pv . Denoting by P the joint distribution of V and X, we have for any test
Ψ : X → {1, 2} that the probability of error is then
1 1
P(Ψ(X) 6= V ) = P1 (Ψ(X) 6= 1) + P2 (Ψ(X) 6= 2).
2 2
We can give an exact expression for the minimal possible error in the above hypothesis test.
Indeed, a standard result of Le Cam (see [127, 177, Lemma 1]) is the following variational representa-
tion of the total variation distance (2.2.6), which is the f -divergence associated with f (t) = 21 |t − 1|,
as a function of testing error.

Proposition 2.3.1. Let X be an arbitrary set. For any distributions P1 and P2 on X , we have

inf {P1 (Ψ(X) 6= 1) + P2 (Ψ(X) 6= 2)} = 1 − kP1 − P2 kTV ,


Ψ

where the infimum is taken over all tests Ψ : X → {1, 2}.

Proof Any test Ψ : X → {1, 2} has an acceptance region, call it A ⊂ X , where it outputs 1 and
a region Ac where it outputs 2.

P1 (Ψ 6= 1) + P2 (Ψ 6= 2) = P1 (Ac ) + P2 (A) = 1 − P1 (A) + P2 (A).

Taking an infimum over such acceptance regions, we have

inf {P1 (Ψ 6= 1) + P2 (Ψ 6= 2)} = inf {1 − (P1 (A) − P2 (A))} = 1 − sup (P1 (A) − P2 (A)),
Ψ A⊂X A⊂X

which yields the total variation distance as desired.

30
Lexture Notes on Statistics and Information Theory John Duchi

In the two-hypothesis case, we also know that the optimal test, by the Neyman-Pearson lemma,
is a likelihood ratio test. That is, assuming that P1 and P2 have densities p1 and p2 , the optimal
test is of the form
1 if pp12 (X)
(
(X) ≥ t
Ψ(X) = p1 (X)
2 if p2 (X) < t

for some threshold t ≥ 0. In the case that the prior probabilities on P1 and P2 are each 21 , then
t = 1 is optimal.
We give one example application of Proposition 2.3.1 to the problem of testing a normal mean.

iid
Example 2.3.2 (Testing a normal mean): Suppose we observe X1 , . . . , Xn ∼ P for P = P1
or P = P2 , where Pv is the normal distribution N(µv , σ 2 ), where µ1 6= µ2 . We would like to
understand the sample size n necessary to guarantee that no test can have small error, that
is, say, that
1
inf {P1 (Ψ(X1 , . . . , Xn ) 6= 1) + P2 (Ψ(X1 , . . . , Xn ) 6= 2)} ≥ .
Ψ 2
By Proposition 2.3.1, we have that

inf {P1 (Ψ(X1 , . . . , Xn ) 6= 1) + P2 (Ψ(X1 , . . . , Xn ) 6= 2)} ≥ 1 − kP1n − P2n kTV ,


Ψ

iid
where Pvn denotes the n-fold product of Pv , that is, the distribution of X1 , . . . , Xn ∼ Pv .
The interaction between total variation distance and product distributions is somewhat subtle,
so it is often advisable to use a divergence measure more attuned to the i.i.d. nature of the sam-
pling scheme. Two such measures are the KL-divergence and Hellinger distance, both of which
we explore in the coming chapters. With that in mind, we apply Pinsker’s inequality (2.2.10)
to see that kP1n − P2n k2TV ≤ 21 Dkl (P1n ||P2n ) = n2 Dkl (P1 ||P2 ), which implies that
r r  1 √
n 1 n 1 2 n |µ1 − µ2 |
1− kP1n − P2n kTV ≥1− Dkl (P1 ||P2 ) 2 = 1 − 2
(µ1 − µ2 )2
=1− .
2 2 2σ 2 σ
σ2
In particular, if n ≤ (µ1 −µ2 )2
, then we have our desired lower bound of 21 .
2
Conversely, a calculation yields that n ≥ (µ1Cσ
−µ2 )2
, for some numerical constant C ≥ 1, implies
small probability of error. We leave this calculation to the reader. 3

2.3.2 Fano’s inequality and multiple hypothesis testing


There are of course situations in which we do not wish to simply test two hypotheses, but have
multiple hypotheses present. In such situations, Fano’s inequality, which we present shortly, is
the most common tool for proving fundamental limits, lower bounds on probability of error, and
converses (to results on achievability of some performance level) in information theroy. We write
this section in terms of general random variables, ignoring the precise setting of selecting an index
in a family of distributions, though that is implicit in what we do.
Let X be a random variable taking values in a finite set X , and assume that we observe a
(different) random variable Y , and then must estimate or guess the true value of X. b That is, we
have the Markov chain
X → Y → X, b

31
Lexture Notes on Statistics and Information Theory John Duchi

and we wish to provide lower bounds on the probability of error—that is, that X b 6= X. If we let
the function h2 (p) = −p log p − (1 − p) log(1 − p) denote the binary entropy (entropy of a Bernoulli
random variable with parameter p), Fano’s inequality takes the following form [e.g. 53, Chapter 2]:
Proposition 2.3.3 (Fano inequality). For any Markov chain X → Y → X,
b we have

b 6= X)) + P(X
h2 (P(X b 6= X) log(|X | − 1) ≥ H(X | X).
b (2.3.1)

Proof This proof follows by expanding an entropy functional in two different ways. Let E be
b 6= X, that is, E = 1 if X
the indicator for the event that X b 6= X and is 0 otherwise. Then we have

H(X, E | X)
b = H(X | E, X)
b + H(E | X)
b
= P(E = 1)H(X | E = 1, X)
b + P(E = 0) H(X | E = 0, X)
b +H(E | X),
b
| {z }
=0

where the zero follows because given there is no error, X has no variability given X.
b Expanding
the entropy by the chain rule in a different order, we have

H(X, E | X)
b = H(X | X)
b + H(E | X,
b X),
| {z }
=0

because E is perfectly predicted by X


b and X. Combining these equalities, we have

H(X | X)
b = H(X, E | X)
b = P(E = 1)H(X | E = 1, X)
b + H(E | X).

Noting that H(E | X) ≤ H(E) = h2 (P(E = 1)), as conditioning reduces entropy, and that
H(X | E = 1, X)
b ≤ log(|X | − 1), as X can take on at most |X | − 1 values when there is an error,
completes the proof.

We can rewrite Proposition 2.3.3 in a convenient way when X is uniform in X . Indeed, by


definition of the mutual information, we have I(X; X)b = H(X) − H(X | X), b so Proposition 8.4.1
implies that in the canonical hypothesis testing problem from Section 8.2.1, we have
Corollary 2.3.4. Assume that X is uniform on X . For any Markov chain X → Y → X,
b

b 6= X) ≥ 1 − I(X; Y ) + log 2
P(X . (2.3.2)
log(|X |)

Proof Let Perror = P(X 6= X) b denote the probability of error. Noting that h2 (p) ≤ log 2 for any
p ∈ [0, 1] (recall inequality (2.1.2), that is, that uniform random variables maximize entropy), then
using Proposition 8.4.1, we have
(i)
b (ii)
log 2 + Perror log(|X |) ≥ h2 (Perror ) + Perror log(|X | − 1) ≥ H(X | X) = H(X) − I(X; X).
b

Here step (i) uses Proposition 2.3.3 and step (ii) uses the definition of mutual information, that
b = H(X) − H(X | X).
I(X; X) b The data processing inequality implies that I(X; X) b ≤ I(X; Y ),
and using H(X) = log(|X |) completes the proof.

32
Lexture Notes on Statistics and Information Theory John Duchi

In particular, Corollary 2.3.4 shows that when X is chosen uniformly at random and we observe
Y , we have
I(X; Y ) + log 2
inf P(Ψ(Y ) 6= X) ≥ 1 − ,
Ψ log |X |
where the infimum is taken over all testing procedures Ψ. Some interpretation of this quantity
is helpful. If we think roughly of the number of bits it takes to describe a variable X uniformly
chosen from X , then we expect that log2 |X | bits are necessary (and sufficient). Thus, until we
collect enough information that I(X; Y ) ≈ log |X |, so that I(X; Y )/ log |X | ≈ 1, we are unlikely to
be unable to identify the variable X with any substantial probability. So we must collect enough
bits to actually discover X.
Example 2.3.5 (20 questions game): In the 20 questions game—a standard children’s game—
there are two players, the “chooser” and the “guesser,” and an agreed upon universe X . The
chooser picks an element x ∈ X , and the guesser’s goal is to find x by using a series of yes/no
questions about x. We consider optimal strategies for each player in this game, assuming that
X is finite and letting m = |X | be the universe size for shorthand.
For the guesser, it is clear that at most dlog2 me questions are necessary to guess the item
X that the chooser has picked—at each round of the game, the guesser asks a question that
eliminates half of the remaining possible items. Indeed, let us assume that m = 2l for some
l ∈ N; if not, the guesser can always make her task more difficult by increasing the size of X
until it is a power of 2. Thus, after k rounds, there are m2−k items left, and we have
 k
1
m ≤ 1 if and only if k ≥ log2 m.
2
For the converse—the chooser’s strategy—let Y1 , Y2 , . . . , Yk be the sequence of yes/no answers
given to the guesser. Assume that the chooser picks X uniformly at random in X . Then Fano’s
inequality (2.3.2) implies that for the guess X
b the guesser makes,

b 6= X) ≥ 1 − I(X; Y1 , . . . , Yk ) + log 2
P(X .
log m
By the chain rule for mutual information, we have
k
X k
X k
X
I(X; Y1 , . . . , Yk ) = I(X; Yi | Y1:i−1 ) = H(Yi | Y1:i−1 ) − H(Yi | Y1:i−1 , X) ≤ H(Yi ).
i=1 i=1 i=1

As the answers Yi are yes/no, we have H(Yi ) ≤ log 2, so that I(X; Y1:k ) ≤ k log 2. Thus we
find
P(Xb 6= X) ≥ 1 − (k + 1) log 2 = log2 m − 1 − k ,
log m log2 m log2 m
so that we the guesser must have k ≥ log2 (m/2) to be guaranteed that she will make no
mistakes. 3

2.4 A first operational result: entropy and source coding


The final section of this chapter explores the basic results in source coding. Source coding—in its
simplest form—tells us precisely the number of bits (or some other form of information storage)
are necessary to perfectly encode a seqeunce of random variables X1 , X2 , . . . drawn according to a
known distribution P .

33
Lexture Notes on Statistics and Information Theory John Duchi

2.4.1 The source coding problem


Assume we receive data consisting of a sequence of symbols X1 , X2 , . . ., drawn from a known
distribution P on a finite or countable space X . We wish to choose an encoding, represented by a
d-ary code function C that maps X to finite strings consisting of the symbols {0, 1, . . . , d − 1}. We
denote this by C : X → {0, 1, . . . , d − 1}∗ , where the superscript ∗ denotes the length may change
from input to input, and use `C (x) to denote the length of the string C(x).
In general, we will consider a variety of types of codes; we define each in order of complexity of
their decoding.
Definition 2.1. A d-ary code C : X → {0, . . . , d − 1}∗ is non-singular if for each x, x0 ∈ X we have
C(x) 6= C(x0 ) if x 6= x0 .
While Definition 2.1 is natural, generally speaking, we wish to transmit or encode a variety of code-
words simultaneously, that is, we wish to encode a sequence X1 , X2 , . . . using the natural extension
of the code C as the string C(X1 )C(X2 )C(X3 ) · · · , where C(x1 )C(x2 ) denotes the concatenation of
the strings C(x1 ) and C(x2 ). In this case, we require that the code be uniquely decodable:
Definition 2.2. A d-ary code C : X → {0, . . . , d − 1}∗ is uniquely decodable if for all sequences
x1 , . . . , xn ∈ X and x01 , . . . , x0n ∈ X we have
C(x1 )C(x2 ) · · · C(xn ) = C(x01 )C(x02 ) · · · C(x0n ) if and only if x1 = x01 , . . . , xn = x0n .
That is, the extension of the code C to sequences is non-singular.
While more useful (generally) than simply non-singular codes, uniquely decodable codes may require
inspection of an entire string before recovering the first element. With that in mind, we now consider
the easiest to use codes, which can always be decoded instantaneously.
Definition 2.3. A d-ary code C : X → {0, . . . , d − 1}∗ is uniquely decodable or instantaneous if
no codeword is the prefix to another codeword.
As is hopefully apparent from the definitions, all prefix/instantaneous codes are uniquely decodable,
which are in turn non-singular. The converse is not true, though we will see a sense in which—as
long as we care only about encoding sequences—using prefix instead of uniquely decodable codes
has negligible consequences.
For example, written English, with periods (.) and spaces ( ) included at the ends of words
(among other punctuation) is an instantaneous encoding of English into the symbols of the alphabet
and punctuation, as punctuation symbols enforce that no “codeword” is a prefix of any other. A
few more concrete examples may make things more clear.
Example 2.4.1 (Encoding strategies): Consider the encoding schemes below, which encode
the letters a, b, c, and d.
Symbol C1 (x) C2 (x) C3 (x)
a 0 00 0
b 00 10 10
c 000 11 110
d 0000 110 111
By inspection, it is clear that C1 is non-singular but certainly not uniquely decodable (does
the sequence 0000 correspond to aaaa, bb, aab, aba, baa, ca, ac, or d?), while C3 is a prefix
code. We leave showing that C2 is uniquely decodable as an exercise. 3

34
Lexture Notes on Statistics and Information Theory John Duchi

2.4.2 The Kraft-McMillan inequalities


We now turn to a few results on the connections between source-coding and entropy. Our first
result, the Kraft-McMillan inequality, is an essential result that—as we shall see–essentially says
that there is no difference in code-lengths attainable by prefix codes and uniquely decodable codes.

0 2
1
x1
2 0 2
0 1 1
x2 x3 x5 x6 x7

Figure 2.1. Prefix-tree encoding of a set of symbols. The encoding for x1 is 0, for x2 is 10, for x3
is 11, for x4 is 12, for x5 is 20, for x6 is 21, and nothing is encoded as 1, 2, or 22.

Theorem 2.4.2. Let X be a finite or countable set, and let ` : X → N be a function. If `(x) is the
length of the encoding of the symbol x in a uniquely decodable d-ary code, then
X
d−`(x) ≤ 1. (2.4.1)
x∈X

Conversely, given any function ` : X → N satisfying inequality (2.4.1), there is a prefix code whose
codewords have length `(x) for each x ∈ X .
Proof We prove the first statement of the theorem first by a counting and asymptotic argument.
We begin by assuming that X is finite; we eliminate this assumption subsequently. As a
consequence, there is some maximum length `max such that `(x) ≤ `max for all x ∈ X . ForP a sequence
x1 , . . . , xn ∈ X , we have by the definition of our encoding strategy that `(x1 , . . . , xn ) = ni=1 `(xi ).
In addition, for each m we let
En (m) := {x1:n ∈ X n such that `(x1:n ) = m}
denote the symbols x encoded with codewords of length m in our code, then as the code is uniquely
decodable we certainly have card(En (m)) ≤ dm P for all n and m. Moreover, for all x1:n ∈ X n we
have `(x1:n ) ≤ n`max . We thus re-index the sum x d−`(x) and compute

X n`
X max

d−`(x1 ,...,xn ) = card(En (m))d−m


x1 ,...,xn ∈X n m=1
n`
X max

≤ dm−m = n`max .
m=1

35
Lexture Notes on Statistics and Information Theory John Duchi

The preceding relation is true for all n ∈ N, so that


 X 1/n
−`(x1:n )
d ≤ n1/n `1/n
max → 1
x1:n ∈X n

as n → ∞. In particular, using that


X X X n
d−`(x1:n ) = d−`(x1 ) · · · d−`(xn ) = d−`(x) ,
x1:n ∈X n x1 ,...,xn ∈X n x∈X

we obtain x∈X d−`(x) ≤ 1.


P
Returning to the case that card(X ) = ∞, by defining the sequence
X
Dk := d−`(x) ,
x∈X ,`(x)≤k

as each subset {xP ∈ X : `(x) ≤ k} is uniquely decodable, we have Dk ≤ 1 for all k. Then
1 ≥ limk→∞ Dk = x∈X d−`(x) .
The achievability of such a code is straightforward by a pictorial argument (recall Figure 2.1),
so we sketch the result non-rigorously. Indeed, let Td be an (infinite) d-ary tree. Then, at each
level m of the tree, assign one of the nodes at that level to each symbol x ∈ X such that `(x) = m.
Eliminate the subtree below that node, and repeat with the remaining symbols. The codeword
corresponding to symbol x is then the path to the symbol in the tree.
JCD Comment: Fill out this proof, potentially deferring it.

With the Kraft-McMillan theorem in place, we we may directly relate the entropy of a random
variable to the length of possible encodings for the variable; in particular, we show that the entropy
is essentially the best possible code length of a uniquely decodable source code. In this theorem,
we use the shorthand X
Hd (X) := − p(x) logd p(x).
x∈X

Theorem 2.4.3. Let X ∈ X be a discrete random variable distributed according to P and let `C
be the length function associated with a d-ary encoding C : X → {0, . . . , d − 1}∗ . In addition, let C
be the set of all uniquely decodable d-ary codes for X . Then

Hd (X) ≤ inf {EP [`C (X)] : C ∈ C} ≤ Hd (X) + 1.

Proof The lower bound is an argument by convex optimization, while for the upper bound
we give an explicit length function and (implicit) prefix code attaining the bound. For the lower
bound, we assume for simplicity that X is finite, and we identify X = {1, . . . , |X |} (let m = |X | for
shorthand). Then as C consists of uniquely decodable codebooks, all the associated length functions
must satisfy the Kraft-McMillan inequality (2.4.1). Letting `i = `(i), the minimal encoding length
is at least (m m
)
X X
−`i
infm pi `i : d ≤1 .
`∈R
i=1 i=1

36
Lexture Notes on Statistics and Information Theory John Duchi

By introducing the Lagrange multiplier λ ≥ 0 for the inequality constraint, we may write the
Lagrangian for the preceding minimization problem as
n
!
X h im
> −`i
L(`, λ) = p ` + λ d −1 with ∇` L(`, λ) = p − λ d−`i log d .
i=1
i=1

θ
θ Pm − logd
In particular, the optimal ` satisfies `i = logd pi for some constant θ, and solving i=1 d
pi
=1
gives θ = 1 and `(i) = logd p1i .
l m
1
To attain the result, simply set our encoding to be `(x) = logd P (X=x) , which satisfies the
Kraft-McMillan inequality and thus yields a valid prefix code with
 
X 1 X
EP [`(X)] = p(x) logd ≤− p(x) logd p(x) + 1 = Hd (X) + 1
p(x)
x∈X x∈X

as desired.

Theorem 2.4.3 thus shows that, at least to within an additive constant of 1, the entropy both
upper and lower bounds the expected length of a uniquely decodable code for the random variable
X. This is the first of our promised “operational interpretations” of the entropy.

2.4.3 Entropy rates and longer codes


Theorem 2.4.3 is a bit unsatisfying in that the additive constant 1 may be quite large relative to
the entropy. By allowing encoding longer sequences, we can (asymptotically) eliminate this error
factor. To that end, we here show that it is possible, at least for appropriate distributions on
random variables Xi , to achieve a per-symbol encoding length that approaches a limiting version of
the Shannon entropy of a random variable. We give two definitions capturing the limiting entropy
properties of sequences of random variables.

Definition 2.4. The entropy rate of a sequence X1 , X2 , . . . of random variables is


1
H({Xi }) := lim H(X1 , . . . , Xn ) (2.4.2)
n→∞ n
whenever the limit exists.

In some situations, the limit (2.4.2) may not exist. However, there are a variety of situations in
which it does, and we focus generally on a specific but common instance in which the limit does
exist. First, we recall the definition of a stationary sequence of random variables.

Definition 2.5. We say a sequence X1 , X2 , . . . of random variable is stationary if for all n and all
k ∈ N and all measurable sets A1 , . . . , Ak ⊂ X we have

P(X1 ∈ A1 , . . . , Xk ∈ Ak ) = P(Xn+1 ∈ A1 , . . . , Xn+k ∈ Ak ).

With this definition, we have the following result.

37
Lexture Notes on Statistics and Information Theory John Duchi

Proposition 2.4.4. Let the sequence of random variables {Xi }, taking values in the discrete space
X , be stationary. Then
H({Xi }) = lim H(Xn | X1 , . . . , Xn−1 )
n→∞

and the limits (2.4.2) and above exist.


1 Pn
Proof We begin by making the following standard observation of Cesàro means: if cn = n i=1 ai
and ai → a, then cn → a.3 Now, we note that for a stationary sequence, we have that

H(Xn | X1:n−1 ) = H(Xn+1 | X2:n ),

and using that conditioning decreases entropy, we have

H(Xn+1 | X1:n ) ≤ H(Xn | X1:n−1 ).

Thus the sequence an := H(Xn | X1:n−1 ) is non-increasing and


P bounded below by 0, so that it has
some limit limn→∞ H(Xn | X1:n−1 ). As H(X1 , . . . , Xn ) = ni=1 H(Xi | X1:i−1 ) by the chain rule
for entropy, we achieve the result of the proposition.

Finally, we present a result showing that it is possible to achieve average code length of at most
the entropy rate, which for stationary sequences is smaller than the entropy of any single random
variable Xi . To do so, we require the use of a block code, which (while it may be prefix code) treats
sets of random variables (X1 , . . . , Xm ) ∈ X m as a single symbol to be jointly encoded.

Proposition 2.4.5. Let the sequence of random variables X1 , X2 , . . . be stationary. Then for any
 > 0, there exists an m ∈ N and a d-ary (prefix) block encoder C : X m → {0, . . . , d − 1}∗ such that
1
lim EP [`C (X1:n )] ≤ H({Xi }) +  = lim H(Xn | X1 , . . . , Xn−1 ) + .
n n n

Proof Let C : X m → {0, 1, . . . , d − 1}∗ be any prefix code with


 
1
`C (x1:m ) ≤ log .
P (X1:m = x1:m )

Then whenever n/m is an integer, we have


n/m n/m
X   X 
EP [`C (X1:n )] = EP `C (Xmi+1 , . . . , Xm(i+1) ) ≤ H(Xmi+1 , . . . , Xm(i+1) ) + 1
i=1 i=1
n n
= + H(X1 , . . . , Xm ).
m m
1 1
Dividing by n gives the result by taking m suitably large that m + m H(X1 , . . . , Xm ) ≤ +H({Xi }).
3
Indeed, let  > 0 and take N such that n ≥ N implies that |ai − a| < . Then for n ≥ N , we have
n n
1X N (cN − a) 1 X N (cN − a)
cn − a = (ai − a) = + (ai − a) ∈ ± .
n i=1 n n i=N +1 n

Taking n → ∞ yields that the term N (cN − a)/n → 0, which gives that cn − a ∈ [−, ] eventually for any  > 0,
which is our desired result.

38
Lexture Notes on Statistics and Information Theory John Duchi

Note that if the m does not divide n, we may also encode the length of the sequence of encoded
words in each block of length m; in particular, if the block begins with a 0, it encodes m symbols,
while if it begins with a 1, then the next dlogd me bits encode the length of the block. This would
yields an increase in the expected length of the code to

2n + dlog2 me n
EP [`C (X1:n )] ≤ + H(X1 , . . . , Xm ).
m m
Dividing by n and letting n → ∞ gives the result, as we can always choose m large.

2.5 Bibliography
The material in this chapter is classical in information theory. For all of our treatment of mutual
information, entropy, and KL-divergence in the discrete case, Cover and Thomas provide an es-
sentially complete treatment in Chapter 2 of their book [53]. Gray [94] provides a more advanced
(measure-theoretic) version of these results, with Chapter 5 covering most of our results (or Chap-
ter 7 in the newer addition of the same book). Csiszár and Körner [55] is the classic reference for
coding theorems and results on communication, including stronger converse results.
The f -divergence was independently discovered by Ali and Silvey [4] and Csiszár [54], and is
consequently sometimes called an Ali-Silvey divergence or Csiszár divergence. Liese and Vajda [131]
provide a survey of f -divergences and their relationships with different statistical concepts (taking
a Bayesian point of view), and various authors have extended the pairwise divergence measures to
divergence measures between multiple distributions [98], making connections to experimental design
and classification [89, 70], which we investigate later in book. The inequalities relating divergences
in Section 2.2.4 are now classical, and standard references present them [127, 167]. For a proof that
equality (2.2.4) is equivalent to the definition (2.2.3) with the appropriate closure operations, see
the paper [70, Proposition 1]. We borrow the proof of the upper bound in Proposition 2.2.10 from
the paper [132].

2.6 Exercises
Our first few questions investigate properties of a divergence between distributions that is weaker
than the KL-divergence, but is intimately related to optimal testing. Let P1 and P2 be arbitrary
distributions on a space X . The total variation distance between P1 and P2 is defined as

kP1 − P2 kTV := sup |P1 (A) − P2 (A)| .


A⊂X

Exercise 2.1: Prove the following identities about total variation. Throughout, let P1 and P2
have densities p1 and p2 on a (common) set X .
R
(a) 2 kP1 − P2 kTV = |p1 (x) − p2 (x)|dx.

(b) For functions f : X → R, Rdefine the supremum norm kf k∞ = supx∈X |f (x)|. Show that
2 kP1 − P2 kTV = supkf k∞ ≤1 X f (x)(p1 (x) − p2 (x))dx.
R
(c) kP1 − P2 kTV = max{p1 (x), p2 (x)}dx − 1.

39
Lexture Notes on Statistics and Information Theory John Duchi

R
(d) kP1 − P2 kTV = 1 − min{p1 (x), p2 (x)}dx.

(e) For functions f, g : X → R,


Z Z 
inf f (x)p1 (x)dx + g(x)p2 (x)dx : f + g ≥ 1, f ≥ 0, g ≥ 0 = 1 − kP1 − P2 kTV .

Exercise 2.2 (Divergence between multivariate normal distributions): Let P1 be N(θ1 , Σ) and
P2 be N(θ2 , Σ), where Σ  0 is a positive definite matrix. What is Dkl (P1 ||P2 )?
Exercise 2.3 (The optimal test between distributions): Prove Le-Cam’s inequality: for any
function ψ with dom ψ ⊃ X and any distributions P1 , P2 ,

P1 (ψ(X) 6= 1) + P2 (ψ(X) 6= 2) ≥ 1 − kP1 − P2 kTV .

Thus, the sum of the probabilities of error in a hypothesis testing problem, where based on a sample
X we must decide whether P1 or P2 is more likely, has value at least 1 − kP1 − P2 kTV . Given P1
and P2 is this risk attainable?
Exercise 2.4: A random variable X has Laplace(λ, µ) distribution if it has density p(x) =
λ
2 exp(−λ|x−µ|). Consider the hypothesis test of P1 versus P2 , where X has distribution Laplace(λ, µ1 )
under P1 and distribution Laplace(λ, µ2 ) under P2 , where µ1 < µ2 . Show that the minimal value
over all tests ψ of P1 versus P2 is
 
 λ
inf P1 (ψ(X) 6= 1) + P2 (ψ(X) 6= 2) = exp − |µ1 − µ2 | .
ψ 2

Exercise 2.5 (Log-sum inequality): Let a1 , . . . , an and b1 , . . . , bn be non-negative reals. Show


that
n X n  Pn
X ai ai
ai log ≥ ai log Pi=1
n .
bi i=1 bi
i=1 i=1

(Hint: use the convexity of the function x 7→ − log(x).)


Exercise 2.6: Given quantizers g1 and g2 , we say that g1 is a finer quantizer than g2 under the
following condition: assume that g1 induces the partition A1 , . . . , An and g2 induces the partition
B1 , . . . , Bm ; then for any of the sets Bi , there are exists some k and sets Ai1 , . . . , Aik such that
Bi = ∪kj=1 Aij . We let g1 ≺ g2 denote that g1 is a finer quantizer than g2 . Prove
(a) Finer partitions increase the KL divergence: if g1 ≺ g2 ,

Dkl (P ||Q | g2 ) ≤ Dkl (P ||Q | g1 ) .

(b) If X is discrete (so P and Q have p.m.f.s p and q) then


X p(x)
Dkl (P ||Q) = p(x) log .
x
q(x)

Exercise 2.7 (f -divergences generalize standard divergences): Show the following properties of
f -divergences:

40
Lexture Notes on Statistics and Information Theory John Duchi

(a) If f (t) = |t − 1|, then Df (P ||Q) = 2 kP − QkTV .

(b) If f (t) = t log t, then Df (P ||Q) = Dkl (P ||Q).

(c) If f (t) = t log t − log t, then Df (P ||Q) = Dkl (P ||Q) + Dkl (Q||P ).

(d) For any convex f satisfying f (1) = 0, Df (P ||Q) ≥ 0. (Hint: use Jensen’s inequality.)

Exercise 2.8 (Generalized “log-sum” inequalities): Let f : R+ → R be an arbitrary convex


function.

(a) Let ai , bi , i = 1, . . . , n be non-negative reals. Prove that


n
X   Pn  Xn  
bi bi
ai f Pni=1 ≤ ai f .
a
i=1 i ai
i=1 i=1

(b) Generalizing the preceding result, let a : X → R+ and b : X → R+ , and let µ be a finite
measure on X with respect to which a is integrable. Show that
Z R  Z  
b(x)dµ(x) b(x)
a(x)dµ(x)f R ≤ a(x)f dµ(x).
a(x)dµ(x) a(x)

If you are unfamiliarR with measure theory, prove the following essentially equivalent result: let
u : X → R+ satisfy u(x)dx < ∞. Show that
Z R  Z  
b(x)u(x)dx b(x)
a(x)u(x)dxf R ≤ a(x)f u(x)dx
a(x)u(x)dx a(x)
R
whenever a(x)u(x)dx
R < ∞. (It is possible to demonstrate this remains true under appropriate
limits even when a(x)u(x)dx = +∞, but it is a mess.)

(Hint: use the fact that the perspective of a function f , defined by h(x, t) = tf (x/t) for t > 0, is
jointly convex in x and t (see Proposition B.3.12).
Exercise 2.9 (Data processing and f -divergences I): As with the KL-divergence, given a quantizer
g of the set X , where g induces a partition A1 , . . . , Am of X , we define the f -divergence between
P and Q conditioned on g as
m m
P (g −1 ({i}))
   
X P (Ai ) X
−1
Df (P ||Q | g) := Q(Ai )f = Q(g ({i}))f .
Q(Ai ) Q(g −1 ({i}))
i=1 i=1

Given quantizers g1 and g2 , we say that g1 is a finer quantizer than g2 under the following condition:
assume that g1 induces the partition A1 , . . . , An and g2 induces the partition B1 , . . . , Bm ; then for
any of the sets Bi , there are exists some k and sets Ai1 , . . . , Aik such that Bi = ∪kj=1 Aij . We let
g1 ≺ g2 denote that g1 is a finer quantizer than g2 .

(a) Let g1 and g2 be quantizers of the set X , and let g1 ≺ g2 , meaning that g1 is a finer quantization
than g2 . Prove that
Df (P ||Q | g2 ) ≤ Df (P ||Q | g1 ) .

41
Lexture Notes on Statistics and Information Theory John Duchi

Equivalently, show that whenever A and B are collections of sets partitioning X , but A is a
finer partition of X than B, that
  X  
X P (B) P (A)
Q(B)f ≤ Q(A)f .
Q(B) Q(A)
B∈B A∈A

(Hint: Use the result of Question 2.8(a)).

(b) Suppose that X is countable (or finite) so that P and Q have p.m.f.s p and q. Show that
 
X p(x)
Df (P ||Q) = q(x)f ,
x
q(x)

where on the left we are using the partition definition (2.2.3); you should show that the partition
into discrete parts of X achieves the supremum. You may assume that X is finite. (Though
feel free to prove the result in the case that X is infinite.)

Exercise 2.10 (General data processing inequalities): Let f be a convex function satisfying
f (1) = 0. Let K be a Markov transition kernel from X to Z, that is, K(·, x) is a probability
distribution on Z for each x ∈ X . (Written differently, we have X → Z, and conditioned on X = x,
Z has distribution K(·, x), so that K(A, x) is the probability that Z ∈ A given X = x.)
R R
(a) Define the marginals KP (A) = K(A, x)p(x)dx and KQ (A) = K(A, x)q(x)dx. Show that

Df (KP ||KQ ) ≤ Df (P ||Q) .

Hint: by equation (2.2.3), w.l.o.g. we may assume that Z is finite and Z = {1, . . . , m}; also
recall Question 2.8.

(b) Let X and Y be random variables with joint distribution PXY and marginals PX and PY .
Define the f -information between X and Y as

If (X; Y ) := Df (PXY ||PX × PY ) .

Use part (a) to show the following general data processing inequality: if we have the Markov
chain X → Y → Z, then
If (X; Z) ≤ If (X; Y ).

Exercise 2.11 (Convexity of f -divergences): Prove Proposition 2.2.11. Hint: Use Question 2.8.
Exercise 2.12 (Variational forms of KL divergence): Let P and Q be arbitrary distributions on a
common space X . Prove the following variational representation, known as the Donsker-Varadhan
theorem, of the KL divergence:

Dkl (P ||Q) = sup EP [f (X)] − log EQ [exp(f (X))] .
f :EQ [ef (X) ]<∞

You may assume that P and Q have densities.

42
Lexture Notes on Statistics and Information Theory John Duchi

Exercise 2.13: Let P and Q have densities p and q with respect to the base measure µ over the
set X . (Recall that this is no loss of generality, as we may take µ = P + Q.) Define the support
supp P := {x ∈ X : p(x) > 0}. Show that
1
Dkl (P ||Q) ≥ log .
Q(supp P )

Exercise 2.14: Let P1 be N(θ1 , Σ1 ) and P2 be N(θ2 , Σ2 ), where Σi  0 are positive definite
matrices. Give Dkl (P1 ||P2 ).
Exercise 2.15: Let {Pv }v∈V be an arbitrary collection of distributions on a space X and µ be a
probability measure on V. Show that if V ∼ µ and conditional on V = v, we draw X ∼ Pv , then
R  R
(a) I(X; V ) = Dkl Pv ||P dµ(v), where P = Pv dµ(v) is the (weighted) average of the Pv . You
may assume that V is discrete if you like.
R 
(b) For any distribution
R Q on X , I(X; V ) = Dkl (Pv ||Q) dµ(v)R − Dkl P ||Q . Conclude that
I(X; V ) ≤ Dkl (Pv ||Q) dµ(v), or, equivalently, P minimizes Dkl (Pv ||Q) dµ(v) over all prob-
abilities Q.

Exercise 2.16 (The triangle inequality for variation distance): Let P and Q be distributions
on X1n = (X1 , . . . , Xn ) ∈ X n , and let Pi (· | xi−1
1 ) be the conditional distribution of Xi given
i−1 i−1
X1 = x1 (and similarly for Qi ). Show that
n
X h i
kP − QkTV ≤ EP Pi (· | X1i−1 ) − Qi (· | X1i−1 ) TV
,
i=1

where the expectation is taken over X1i−1 distributed according to P .


Exercise 2.17: Let h(p) = −p log p − (1 − p) log(1 − p). Show that h(p) ≥ 2 log 2 · min{p, 1 − p}.
Exercise 2.18p(Lin [132], Theorem 8): Let h(p) = −p log p − (1 − p) log(1 − p). Show that
h(p) ≤ 2 log 2 · p(1 − p).
Exercise 2.19 (Proving Pinsker’s inequality via data processing): We work through a proof of
Proposition 2.2.8.(a) using the data processing inequality for f -divergences (Proposition 2.2.13).

(a) Define Dkl (p||q) = p log pq + (1 − p) log 1−p


1−q . Argue that to prove Pinsker’s inequality (2.2.10),
2 1
it is enough to show that (p − q) ≤ 2 Dkl (p||q).

(b) Define the negative binary entropy h(p) = p log p + (1 − p) log(1 − p). Show that

h(p) ≥ h(q) + h0 (q)(p − q) + 2(p − q)2

for any p, q ∈ [0, 1].

(c) Conclude Pinsker’s inequality (2.2.10).

JCD Comment: Below are a few potential questions

43
Lexture Notes on Statistics and Information Theory John Duchi

Exercise 2.20: Use the paper “A New Metric for Probability Distributions” by Dominik p Endres
and Johannes Schindelin to prove that if V ∼ Uniform{0, 1} and X | V = v ∼ Pv , then I(X; V )
is a metric on distributions. (Said differently, Djs (P ||Q)1/2 is a metric on distributions, and it
generates the same topology as the TV-distance.)
Exercise 2.21: Relate the generalized Jensen-Shannon divergence between m distributions to
redundancy in encoding.

44
Chapter 3

Exponential families and statistical


modeling

Our second introductory chapter focuses on readers who may be less familiar with statistical mod-
eling methodology and the how and why of fitting different statistical models. As in the preceding
introductory chapter on information theory, this chapter will be a fairly terse blitz through the main
ideas. Nonetheless, the ideas and distributions here should give us something on which to hang our
hats, so to speak, as the distributions and models provide the basis for examples throughout the
book. Exponential family models form the basis of much of statistics, as they are a natural step
away from the most basic families of distributions—Gaussians—which admit exact computations
but are brittle, to a more flexible set of models that retain enough analytical elegance to permit
careful analyses while giving power in modeling. A key property is that fitting exponential family
models reduces to the minimization of convex functions—convex optimization problems—an oper-
ation we treat as a technology akin to evaluating a function like sin or cos. This perspective (which
is accurate enough) will arise throughout this book, and informs the philosophy we adopt that once
we formulate a problem as convex, it is solved.

3.1 Exponential family models


We begin by defining exponential family distributions, giving several examples to illustrate a few
of their properties. There are three key objects when defining a d-dimensional exponential family
distribution on an underlying space X : the sufficient statistic φ : X → Rd representing what we
model, a canonical parameter vector θ ∈ Rd , and a carrier h : X → R+ .
In the discrete case, where X is a discrete set, the exponential family associated with the
sufficient statistic φ and carrier h has probability mass function
pθ (x) = h(x) exp (hθ, φ(x)i − A(θ)) ,
where A is the log-partition-function, sometimes called the cumulant generating function, with
X
A(θ) := log h(x) exp(hθ, φ(x)i).
x∈X

In the continuous case, pθ is instead a density on X ⊂ Rk , and pθ takes the identical form above
but Z
A(θ) = log h(x) exp(hθ, φ(x)i)dx.
X

45
Lexture Notes on Statistics and Information Theory John Duchi

We can abstract away from this distinction between discrete and continuous distributions by making
the definition measure-theoretic, which we do here for completeness. (But recall the remarks in
Section 1.3.)
With our notation, we have the following definition.

Definition 3.1. The exponential family associated with the function φ and base measure µ is
defined as the set of distributions with densities pθ with respect to µ, where

pθ (x) = exp (hθ, φ(x)i − A(θ)) , (3.1.1)

and the function A is the log-partition-function (or cumulant function)


Z
A(θ) := log exp (hθ, φ(x)i) dµ(x) (3.1.2)
X

whenever A is finite (and is +∞ otherwise). The family is regular if the domain

Θ := {θ | A(θ) < ∞}

is open.

In Definition 3.1, we have included the carrier h in the base measure µ, and frequently we will give
ourselves the general notation

pθ (x) = h(x) exp(hθ, φ(x)i − A(θ)).

In some scenarios, it may be convient to re-parameterize the problem in terms of some function
η(θ) instead of θ itself; we will not worry about such issues and simply use the formulae that are
most convenient.
We now give a few examples of exponential family models.

Example 3.1.1 (Bernoulli distribution): In this case, we have X ∈ {0, 1} and P (X = 1) = p


for some p ∈ [0, 1] in the classical version of a Bernoulli. Thus we take µ to be the counting
p
measure on {0, 1}, and by setting θ = log 1−p to obtain a canonical representation, we have

P (X = x) = p(x) = px (1 − p)1−x = exp(x log p − x log(1 − p))


 
p  
= exp x log + log(1 − p) = exp xθ − log(1 + eθ ) .
1−p

The Bernoulli family thus has log-partition function A(θ) = log(1 + eθ ). 3

Example 3.1.2 (Poisson distribution): The Poisson distribution (for count data) is usually
parameterized by some λ > 0, and for x ∈ N has distribution Pλ (X = x) = (1/x!)λx e− λ. Thus
by taking µ to be counting (discrete) measure on {0, 1, . . .} and setting θ = log λ, we find the
density (probability mass function in this case)
1 x −λ 1 1
p(x) = λ e = exp(x log λ − λ) = exp(xθ − eθ ) .
x! x! x!
Notably, taking h(x) = (x!)−1 and log-partition A(θ) = eθ , we have probability mass function
pθ (x) = h(x) exp(θx − A(θ)). 3

46
Lexture Notes on Statistics and Information Theory John Duchi

Example 3.1.3 (Normal distribution, mean parameterization): For the d-dimensional normal
distribution, we take µ to be Lebesgue measure on Rd . If we fix the covariance and vary only
the mean µ in the family N(µ, Σ), then X ∼ N(µ, Σ) has density
 
1 > −1 1
pµ (x) = exp − (x − µ) Σ (x − µ) − log det(2πΣ) .
2 2

Setting h(x) = − 12 x> Σ−1 x and reparameterizing θ = Σ−1 µ, we obtain


   
1 > −1 1 > 1 >
pθ (x) = exp − x Σ x − log det(2πΣ) exp x θ − θ Σθ .
2 2 2
| {z }
=:h(x)

In particular, we have carrier h(x) = exp(− 12 x> Σ−1 x)/((2π)d/2 det(Σ)), sufficient statistic
φ(x) = x, and log partition A(θ) = 21 θ> Σ−1 θ. 3

Example 3.1.4 (Normal distribution): Let X ∼ N(µ, Σ). We may re-parameterize this as
as Θ = Σ−1 and θ = Σ−1 µ, and we have density
 
1
pθ,Θ (x) ∝ exp hθ, xi − hxx> , Θi ,
2

where h·, ·i denotes the Euclidean inner product. See Exercise 3.1. 3

In some cases, it is analytically convenient to include a few more conditions on the exponential
family.

Definition 3.2. Let {Pθ }θ∈Θ be an exponential family as in Definition 3.1. The sufficient statistic
φ is minimal if Θ = dom A ⊂ Rd is full-dimensional and there exists no vector u such that

hu, φ(x)i is constant µ-almost surely.

Definition 3.2 is essentially equivalent to stating that φ(x) = (φ1 (x), . . . , φd (x)) has linearly inde-
pendent components when viewed as vectors [φi (x)]x∈X . While we do not prove this, via a suitable
linear transformation—a variant of Gram-Schmidt orthonormalization—one may modify any non-
minimal exponential family {Pθ } into an equivalent minimal exponential family {Qη }, meaning
that the two collections satisfy the equality {Pθ } = {Qη } (see Brown [39, Chapter 1]).

3.2 Why exponential families?


There are many reasons for us to study exponential families. The first major reason is their
analytical tractability: as the normal distribution does, they often admit relatively straightforward
computation, therefore forming a natural basis for modeling decisions. Their analytic tractability
has made them the objects of substantial study for nearly the past hundred years; Brown [39]
provides a deep and elegant treatment. Moreover, as we see later, they arise as the solutions to
several natural optimization problems on the space of probability distributions, and they also enjoy
certain robustness properties related to optimal Bayes’ procedures (there is, of course, more to
come on this topic).

47
Lexture Notes on Statistics and Information Theory John Duchi

Here, we enumerate a few of their keyR analytical properties, focusing on the cumulant generating
(or log partition) function A(θ) = log ehθ,φ(x)i dµ(x). We begin with a heuristic calculation, where
we assume that we exchange differentiation and integration. Assuming that this is the case, we
then obtain the important expectation and covariance relationships that
Z
1
∇A(θ) = R hθ,φ(x)i ∇θ ehθ,φ(x)i dµ(x)
e dµ(x)
Z Z
−A(θ) hθ,φ(x)i
=e ∇θ e dµ(x) = φ(x)ehθ,φ(x)i−A(θ) dµ(x) = Eθ [φ(X)]

because ehθ,φ(x)i−A(θ) = pθ (x). A completely similar (and still heuristic, at least at this point)
calculation gives

∇2 A(θ) = Eθ [φ(X)φ(X)> ] − Eθ [φ(X)]Eθ [φ(X)]> = Covθ (φ(X)).

That these identities hold is no accident and is central to the appeal of exponential family models.
The first and, from our perspective, most important result about exponential family models is
their convexity. While (assuming the differentiation relationships above hold) the differentiation
identity that ∇2 A(θ) = Covθ (φ(X))  0 makes convexity of A immediate, one can also provide a
direct argument without appealing to differentiation.

Proposition 3.2.1. The cumulant-generating function θ 7→ A(θ) is convex, and it is strictly convex
if and only if Covθ (φ(X)) is positive definite for all θ ∈ dom A.

Proof Let θλ = λθ1 + (1 − λ)θ2 , where θ1 , θ2 ∈ Θ. Then 1/λ ≥ 1 and 1/(1 − λ) ≥ 1, and Hölder’s
inequality implies
Z Z
log exp(hθλ , φ(x)i)dµ(x) = log exp(hθ1 , φ(x)i)λ exp(hθ2 , φ(x)i)1−λ dµ(x)
Z λ Z 1−λ
λ 1−λ
≤ log exp(hθ1 , φ(x)i) dµ(x)
λ exp(hθ2 , φ(x)i) 1−λ dµ(x)
Z Z
= λ log exp(hθ1 , φ(x)i)dµ(x) + (1 − λ) log exp(hθ2 , φ(x)i)dµ(x),

as desired. The strict convexity will be a consequence of Proposition 3.2.2 to come, as there we
formally show that ∇2 A(θ) = Covθ (φ(X)).

We now show that A(θ) is indeed infinitely differentiable and how it generates the moments of
the sufficient statistics φ(x). To describe the properties, we provide a bit of notation related to
tensor products: for a vector x ∈ Rd , we let

x⊗k := x
| ⊗x⊗ {z· · · ⊗ x}
k times

denote the kth order tensor, or multilinear operator, that for v1 , . . . , vk ∈ Rd satisfies
k
Y
⊗k
x (v1 , . . . , vk ) := hx, v1 i · · · hx, vk i = hx, vi i.
i=1

48
Lexture Notes on Statistics and Information Theory John Duchi

When k = 2, this is the familiar outer product x⊗2 = xx> . (More generally, one may think of x⊗k
as a d × d × · · · × d box, where the (i1 , . . . , ik ) entry is [x⊗k ]i1 ,...,ik = xi1 · · · xik .) With this notation,
our first key result regards the differentiability of A, where we can compute (all) derivatives of eA(θ)
by interchanging integration and differentiation.

Proposition 3.2.2. The cumulant-generating function θ 7→ A(θ) is infinitely differentiable on the


interior of its domain Θ := {θ ∈ Rd : A(θ) < ∞}. The moment-generating function
Z
M (θ) := exp(hθ, φ(x)i)dµ(x)

is analytic on the set ΘC := {z ∈ Cd | Re z ∈ Θ}. Additionally, the derivatives of M are computed


by passing through the integral, that is,
Z Z
hθ,φ(x)i
k k
∇θ M (θ) = ∇θ e dµ(x) = ∇kθ ehθ,φ(x)i dµ(x)
Z
= φ(x)⊗k exp(hθ, φ(x)i)dµ(x).

The proof of the proposition is involved and requires complex analysis, so we defer it to Sec. 3.6.1.
As particular consequences of Proposition 3.2.2, we can rigorously demonstrate the expectation
and covariance relationships that
Z Z
1 hθ,φ(x)i
∇A(θ) = R hθ,φ(x)i ∇e dµ(x) = φ(x)pθ (x)dµ(x) = Eθ [φ(X)]
e dµ(x)

and

( φ(x)ehθ,φ(x)i dµ(x))⊗2
Z R
2 1 ⊗2 hθ,φ(x)i
∇ A(θ) = R φ(x) e dµ(x) − R
ehθ,φ(x)i dµ(x) ( ehθ,φ(x)i dµ(x))2
= Eθ [φ(X)φ(X)> ] − Eθ [φ(X)]Eθ [φ(X)]>
= Covθ (φ(X)).

Minimal exponential families (Definition 3.2) also enjoy a few additional regularity properties.
Recall that A is strictly convex if

A(λθ0 + (1 − λ)θ1 ) < λA(θ0 ) + (1 − λ)A(θ1 )

whenever λ ∈ (0, 1) and θ0 , θ1 ∈ dom A. We have the following proposition.

Proposition 3.2.3. Let {Pθ } be a regular exponential family. The log partition function A is
strictly convex if and only if {Pθ } is minimal.

Proof If the family is minimal, then Varθ (u> φ(X)) > 0 for any vector u, while Varθ (u> φ(X)) =
u> ∇2 A(θ)u. This implies the strict positive definiteness ∇2 A(θ)  0, which is equivalent to strict
convexity (see Corollary B.3.2 in Appendix B.3.1). Conversely, if ∇2 A(θ)  0 for all θ ∈ Θ, then
Varθ (u> φ(X)) > 0 for all u 6= 0 and so u> φ(x) is non-constant in x.

49
Lexture Notes on Statistics and Information Theory John Duchi

3.2.1 Fitting an exponential family model


The convexity and differentiability properties make exponential family models especially attractive
from a computational perspective. A major focus in statistics is the convergence of estimates of
different properties of a population distribution P and whether these estimates are computable.
We will develop tools to address the first of these questions, and attendant optimality guarantees,
throughout this book. To set the stage for what follows, let us consider what this entails in the
context of exponential family models.
Suppose we have a population P (where, for simplicity, we assume P has a density p), and for
a given exponential family P with densities {pθ }, we wish to find the model closest to P . Then it
is natural (if we take on faith that the information-theoretic measures we have developed are the
“right” ones) find the distribution Pθ ∈ P closest to P in KL-divergence, that is, to solve
Z
p(x)
minimize Dkl (P ||Pθ ) = p(x) log dx. (3.2.1)
θ pθ (x)

This is evidently equivalent to minimizing


Z Z
− p(x) log pθ (x)dx = p(x) [−hθ, φ(x)i + A(θ)] dx = −hθ, EP [φ(X)]i + A(θ).

This is always a convex optimization problem (see Appendices B and C for much more on this), as A
is convex and the first term is linear, and so has no non-global optima. Here and throughout, as we
mention in the introductory remarks to this chapter, we treat convex optimization as a technology:
as long as the dimension of a problem is not too large and its objective can be evaluated, it is
(essentially) computationally trivial.
Of course, we never have access to the population P fully; instead, we receive a sample
X1 , . . . , Xn from P . In this case, a natural approach is to replace the expected (negative) log
likelihood above with its empirical version and solve
n
X n
X
minimize − log pθ (Xi ) = [−hθ, φ(Xi )i + A(θ)], (3.2.2)
θ
i=1 i=1

which is still a convex optimization problem (as the objective is convex in θ). The maximum
likelihood estimate is any vector θbn minimizing the negative log likelihood (3.2.2), which by setting
gradients to 0 is evidently any vector satisfying
n
1X
∇A(θbn ) = Eθbn [φ(X)] = φ(Xi ). (3.2.3)
n
i=1

In particular, we need only find a parameter θbn matching moments of the empirical distribution
of the observed Xi ∼ P . This θbn is unique whenever Covθ (φ(X))  0 for all θ, that is, when
the covariance of φ is full rank in the exponential family model, because then the objective in the
minimization problem (3.2.2) is strictly convex.
Let us proceed heuristically for a moment to develop a rough convergence guarantee for the
estimator θbn ; the next paragraph assumes a comfort with some of classical asymptotic statistics
(and the central limit theorem) and is not essential for what comes later. Then we can see how
minimizers of the problem (3.2.2) converge to their population counterparts. Assume that the data

50
Lexture Notes on Statistics and Information Theory John Duchi

Xi are i.i.d. from an exponential family model Pθ? . Then we expect that the maximum likelihood
estimate θbn should converge to θ? , and so
n
1X
φ(Xi ) = ∇A(θbn ) = ∇A(θ? ) + (∇2 A(θ? ) + o(1))(θbn − θ? ).
n
i=1

But of course, ∇A(θ? ) = Eθ? [φ(X)], and so the central limit theorem gives that
n
1X ·
(φ(Xi ) − ∇A(θ? )) ∼ N 0, n−1 Covθ? (φ(X)) = N 0, n−1 ∇2 A(θ? ) ,
 
n
i=1
·
where ∼ means “is approximately distributed as.” Multiplying by (∇2 A(θ? )+o(1))−1 ≈ ∇2 A(θ? )−1 ,
we thus see (still working in our heuristic)
n
1 X
θbn − θ? = (∇2 A(θ? ) + o(1))−1 (φ(Xi ) − ∇A(θ? ))
n
i=1
· −1 2 ? −1

∼ N 0, n · ∇ A(θ ) , (3.2.4)

where we use that BZ ∼ N(0, BΣB > ) if Z ∼ N(0, Σ). (It is possible to make each of these steps
fully rigorous.) Thus the cumulant generating function A governs the error we expect in θbn − θ? .
Much of the rest of this book explores properties of these types of minimization problems: at
what rates do we expect θbn to converge to a global minimizer of problem (3.2.1)? Can we show
that these rates are optimal? Is this the “right” strategy for choosing a parameter? Exponential
families form a particular working example to motivate this development.

3.3 Divergence measures and information for exponential families


Their nice analytic properties mean that exponential family models also play nicely with the in-
formation theoretic tools we develop. Indeed, consider the KL-divergence between two exponential
family distributions Pθ and Pθ+∆ , where ∆ ∈ Rd . Then we have

Dkl (Pθ ||Pθ+∆ ) = Eθ [hθ, φ(X)i − A(θ) − hθ + ∆, φ(X)i + A(θ + ∆)]


= A(θ + ∆) − A(θ) − Eθ [h∆, φ(X)i]
= A(θ + ∆) − A(θ) − ∇A(θ)> ∆.

Similarly, we have

Dkl (Pθ+∆ ||Pθ ) = Eθ+∆ [hθ + ∆, φ(X)i − A(θ + ∆) − hθ, φ(X)i + A(θ)]
= A(θ) − A(θ + ∆) + Eθ+∆ [h∆, φ(X)i]
= A(θ) − A(θ + ∆) − ∇A(θ + ∆)> (−∆).

These identities give an immediate connection with convexity. Indeed, for a differentiable convex
function h, the first-order divergence associated with h is

Dh (u, v) = h(u) − h(v) − h∇h(v), u − vi, (3.3.1)

which is always nonnegative, and is the gap between the linear approximation to the (convex)
function h and its actual value. In much of the statistical and machine learning literature, the

51
Lexture Notes on Statistics and Information Theory John Duchi

divergence (3.3.1) is called a Bregman divergence, though we will use the more evocative first-
order divergence. These will appear frequently throughout the book and, more generally, appear
frequently in work on optimization and statistics.
JCD Comment: Put in a picture of a Bregman divergence

We catalog these results as the following proposition.


Proposition 3.3.1. Let {Pθ } be an exponential family model with cumulant generating function
A(θ). Then

Dkl (Pθ ||Pθ+∆ ) = DA (θ + ∆, θ) and Dkl (Pθ+∆ ||Pθ ) = DA (θ, θ + ∆).

Additionally, there exists a t ∈ [0, 1] such that


1
Dkl (Pθ ||Pθ+∆ ) = ∆> ∇2 A(θ + t∆)∆,
2
and similarly, there exists a t ∈ [0, 1] such that
1
Dkl (Pθ+∆ ||Pθ ) = ∆> ∇2 A(θ + t∆)∆.
2
Proof We have already shown the first two statements; the second two are applications of Tay-
lor’s theorem.

When the perturbation ∆ is small, that A is infinitely differentiable then gives that
1
Dkl (Pθ ||Pθ+∆ ) = ∆> ∇2 A(θ)∆ + O(k∆k3 ),
2
so that the Hessian ∇2 A(θ) tells quite precisely how the KL divergence changes as θ varies (locally).
As we saw already in Example 2.3.2 (and see the next section), when the KL-divergence between
two distributions is small, it is hard to test between them, and in the sequel, we will show converses
to this. The Hessian ∇2 A(θ? ) also governs the error in the estimate θbn − θ? in our heuristic (3.2.4).
When the Hessian ∇2 A(θ) is quite positive semidefinite, the KL divergence Dkl (Pθ ||Pθ+∆ ) is large,
and the asymptotic covariance (3.2.4) is small. For this—and other reasons we address later—for
exponential family models, we call

∇2 A(θ) = Covθ (φ(X)) = Eθ [∇ log pθ (X)∇ log pθ (X)> ] (3.3.2)

the Fisher information of the parameter θ in the model {Pθ }.

3.4 Generalized linear models and regression


We can specialize the general modeling strategies that exponential families provide to more directly
address prediction problems, where we wish to predict a target Y ∈ Y given covariates X ∈ X .
Here, we almost always have that Y is either discrete or continuous with Y ⊂ R. In this case, we
have a sufficient statistic φ : X × Y → Rd , and we model Y | X = x via the generalized linear model
(or conditional exponential family model) if it has density or probability mass function
 
pθ (y | x) = exp φ(x, y)> θ − A(θ | x) h(y), (3.4.1)

52
Lexture Notes on Statistics and Information Theory John Duchi

where as before h is the carrier and (in the case that Y ⊂ Rk )


Z
A(θ | x) = log exp(φ(x, y)> θ)h(y)dy

or, in the discrete case, X


A(θ | x) = log exp(φ(x, y)> θ)h(y).
y

The log partition function A(· | x) provides the same insights for the conditional models (3.4.1)
as it does for the unconditional exponential family models in the preceding sections. Indeed, as
in Propositions 3.2.1 and 3.2.2, the log partition A(· | x) is always C ∞ on its domain and convex.
Moreover, it gives the expected moments of the sufficient statistic φ conditional on x, as

∇A(θ | x) = Eθ [φ(X, Y ) | X = x],

from which we can (typically) extract the mean or other statistics of Y conditional on x.
Three standard examples will be our most frequent motivators throughout this book: linear
regression, binary logistic regression, and multiclass logistic regression. We give these three, as
well as describing two more important examples involving modeling count data through Poisson
regression and making predictions for targets y known to live in a bounded set.

Example 3.4.1 (Linear regression): In linear regression, we wish to predict Y ∈ R from a


vector X ∈ Rd , and assume that Y | X = x follow the normal distribution N(θ> x, σ 2 ). In this
case, we have
 
1 1 > 2
pθ (y | x) = √ exp − 2 (y − x θ)
2πσ 2 2σ
   
1 > 1 > > 1 2 1 2
= exp yx θ − 2 θ xx θ exp − 2 y + log(2πσ ) ,
σ2 2σ 2σ 2

so that we have the exponential family representation (3.4.1) with φ(x, y) = σ12 xy, h(y) =
exp(− 2σ1 2 y 2 + 21 log(2πσ 2 )), and A(θ) = 2σ1 2 θ> xx> θ. As ∇A(θ | x) = Eθ [φ(X, Y ) | X = x] =
1
σ2
xEθ [Y | X = x], we easily recover Eθ [Y | X = x] = θ> x. 3

Frequently, we wish to predict binary or multiclass random variables Y . For example, consider
a medical application in which we wish to assess the probability that, based on a set of covariates
x ∈ Rd (say, blood pressure, height, weight, family history) and individual will have a heart attack
in the next 5 years, so that Y = 1 indicates heart attack and Y = −1 indicates not. The next
example shows how we might model this.
Example 3.4.2 (Binary logistic regression): If Y ∈ {−1, 1}, we model

exp(yx> θ)
pθ (y | x) = ,
1 + exp(yx> θ)

where the idea in the probability above is that if x> θ has the same sign as y, then the large
x> θy becomes the higher the probability assigned the label y; when x> θy < 0, the probability
is small. Of course, we always have pθ (y | x) + pθ (−y | x) = 1, and using the identity
y+1 >
yx> θ − log(1 + exp(yx> θ)) = x θ − log(1 + exp(x> θ))
2

53
Lexture Notes on Statistics and Information Theory John Duchi

y+1
we obtain the generalized linear model representation φ(x, y) = 2 x and A(θ | x) = log(1 +
exp(x> θ)).
As an alternative, we could represent Y ∈ {0, 1} by

exp(yx> θ) 
> x> θ

pθ (y | x) = = exp yx θ − log(1 + e ) ,
1 + exp(x> θ)
which has the simpler sufficient statistic φ(x, y) = xy. 3
Instead of a binary prediction problem, in many cases we have a multiclass prediction problem,
where we seek to predict a label Y for an object x belonging to one of k different classes. For
example, in image recognition, we are given an image x and wish to identify the subject Y of the
image, where Y ranges over k classes, such as birds, dogs, cars, trucks, and so on. This too we can
model using exponential families.
Example 3.4.3 (Multiclass logistic regression): In the case that we have a k-class prediction
problem in which we wish to predict Y ∈ {1, . . . , k} from X ∈ Rd , we assign parameters
θy ∈ Rd to each of the classes y = 1, . . . , k. We then model
 
> k
exp(θy x)
X 
>
pθ (y | x) = Pk = exp θy> x − log eθj x  .
>
j=1 exp(θj x) j=1

Here, the idea is that if θy> x > θj> x for all j 6= y, then the model assigns higher probability to
class y than any other class; the larger the gap between θy> x and θj> x, the larger the difference
in assigned probabilities. 3
Other approaches with these ideas allow us to model other situations. Poisson regression models
are frequent choices for modeling count data. For example, consider an insurance company that
wishes to issue premiums for shipping cargo in different seasons and on different routes, and so
wishes to predict the number of times a given cargo ship will be damaged by waves over a period
of service; we might represent this with a feature vector x encoding information about the ship to
be insured, typical weather on the route it will take, and the length of time it will be in service.
To model such counts Y ∈ {0, 1, 2, . . .}, we turn to Poisson regression.
Example 3.4.4 (Poisson regression): When Y ∈ N is a count, the Poisson distribution with
−λ y >
rate λ > 0 gives P (Y = y) = e y!λ . Poisson regression models λ via eθ x , giving model

1  >

pθ (y | x) = exp yx> θ − eθ x ,
y!
so that we have carrier h(y) = 1/y! and the simple sufficient statistic yx> θ. The log partition
>
function is A(θ | x) = eθ x . 3
Lastly, we consider a less standard example, but which highlights the flexibility of these models.
Here, we assume a linear regression problem but in which we wish to predict values Y in a bounded
range.
Example 3.4.5 (Bounded range regression): Suppose that we know Y ∈ [−b, b], but we wish
to model it via an exponential family model with density

pθ (y | x) = exp(yx> θ − A(θ | x))1 {y ∈ [−b, b]} ,

54
Lexture Notes on Statistics and Information Theory John Duchi

which is non-zero only for −b ≤ y ≤ b. Letting s = x> θ for shorthand, we have


Z b
1 h bs i
eys dy = e − e−bs ,
−b s
where the limit as s → 0 is 2b; the (conditional) log partition function is thus
bθ > x −e−bθ > x
(
log e θ> x
if θ> x 6= 0
A(θ | x) =
log(2b) otherwise.

While its functional form makes this highly non-obvious, our general results guarantee that
A(θ | x) is indeed C ∞ and convex in θ. We have ∇A(θ | x) = xEθ [Y | X = x] because
φ(x, y) = xy, and we can therefore immediately recover Eθ [Y | X = x]. Indeed, set s = θ> x,
and without loss of generality assume s 6= 0. Then
∂ ebs − e−bs b(ebs + e−bs ) 1
E[Y | x> θ = s] = log = bs − ,
∂s s e − e−bs s
which increases from −b to b as s = x> θ increases from −∞ to +∞. 3

3.4.1 Fitting a generalized linear model from a sample


We briefly revisit the approach in Section 3.2.1 for fitting exponential family models in the context
of generalized linear models. In this case, the analogue of the maximum likelihood problem (3.2.2)
is to solve
Xn n h
X i
minimize − log pθ (Yi | Xi ) = −φ(Xi , Yi )> θ + A(θ | Xi ) .
θ
i=1 i=1
This is a convex optimization problem with C∞
objective, so we can treat solving it as an (essen-
tially) trivial problem unless the sample size n or dimension d of θ are astronomically large.
As in the moment matching equality (3.2.3), a necessary and sufficient condition for θbn to
minimize the above objective is that it achieves 0 gradient, that is,
n n
1X 1X
∇A(θbn | Xi ) = φ(Xi , Yi ).
n n
i=1 i=1

Once again, to find θbn amounts to matching moments, as ∇A(θ | Xi ) = E[φ(X, Y ) | X = Xi ], and
we still enjoy the convexity properties of the standard exponential family models.
In general, we of course do not expect any exponential family or generalized linear model (GLM)
to have perfect fidelity to the world: all models are in accurate (but many are useful!). Nonetheless,
we can still fit any of the GLM models in Examples 3.4.1–3.4.5 to data of the appropriate type. In
particular, for the logarithmic loss `(θ; x, y) = − log pθ (y | x), we can define the empirical loss
n
1X
Ln (θ) := `(θ; Xi , Yi ).
n
i=1

Then, as n → ∞, we expect that Ln (θ) → E[`(θ; X, Y )], so that the minimizing θ should give the
best predictions possible according to the loss `. We shall therefore often be interested in such
convergence guarantees and the deviations of sample quantities (like Ln ) from their population
counterparts.

55
Lexture Notes on Statistics and Information Theory John Duchi

3.5 Lower bounds on testing a parameter’s value


We give a bit of a preview here of the tools we will develop to prove fundamental limits in Part II of
the book, an hors d’oeuvres that points to the techniques we develop. In Section 2.3.1, we presented
Le Cam’s method and used it in Example 2.3.2 to give a lower bound on the probability of error in
a hypothesis test comparing two normal means. This approach extends beyond this simple case,
and here we give another example applying it to exponential family models.
We give a stylized version of the problem. Let {Pθ } be an exponential family model with
parameter θ ∈ Rd . Suppose for some vector v ∈ Rd , we wish to test whether v > θ > 0 or v > θ < 0 in
the model. For example, in the regression settings in Section 3.4, we may be interested in the effect
of a treatment on health outcomes. Then the covariates x contain information about an individual
with first index x1 corresponding to whether the individual is treated or not, while Y measures the
outcome of treatment; setting v = e1 , we then wish to test whether there is a positive treatment
effect θ1 = e>
1 θ > 0 or negative.
Abstracting away the specifics of the scenario, we ask the following question: given an exponen-
tial family {Pθ } and a threshold t of interest, at what separation δ > 0 does it become essentially
impossible to test
v > θ ≤ t versus v > θ ≥ t + δ?
We give one approach to this using two-point hypothesis testing lower bounds. In this case, we
consider testing sequences of two alternatives

H0 : θ = θ0 versus H1,n : θ = θn

as n grows, where we observe a sample X1n drawn i.i.d. either according to Pθ0 (i.e., H0 ) or Pθn
(i.e., H1,n ). By choosing θn in a way that makes the separation v > (θn − θ0 ) large but testing H0
against H1,n challenging, we can then (roughly) identify the separation δ at which testing becomes
impossible.

Proposition 3.5.1. Let θ0 ∈ Rd . Then there exists a sequence of parameters θn with kθn − θ0 k =

O(1 n), separation
1
q
v > (θn − θ0 ) = √ v > ∇2 A(θ0 )−1 v,
n
and for which
1
inf {Pθ0 (Ψ(X1n ) 6= 0) + Pθn (Ψ(X1n ) 6= 1)} ≥ + O(n−1/2 ).
Ψ 2
Proof Let ∆ ∈ Rd be a potential perturbation to θ1 = θ0 + ∆, which gives separation δ =
v > θ1 − v > θ0 = v > ∆. Let P0 = Pθ0 and P1 = Pθ1 . Then the smallest summed probability of error
in testing between P0 and P1 based on n observations X1n is

inf {P0 (Ψ(X1 , . . . , Xn ) 6= 0) + P1 (Ψ(X1 , . . . , Xn ) 6= 1)} = 1 − kP0n − P1n kTV


Ψ

by Proposition 2.3.1. Following the approach of Example 2.3.2, we apply Pinsker’s inequal-
ity (2.2.10) and use that the KL-divergence tensorizes to find

2 kP0n − P1n k2TV ≤ nDkl (P0 ||P1 ) = nDkl (Pθ0 ||Pθ0 +∆ ) = nDA (θ0 + ∆, θ0 ),

where the final equality follows from the equivalence between KL and first-order divergences for
exponential families (Proposition 3.3.1).

56
Lexture Notes on Statistics and Information Theory John Duchi

To guarantee that the summed probability of error is at least 21 , that is, kP0n − P1n kTV ≤ 12 ,
it suffices to choose ∆ satisfying nDA (θ0 + ∆, θ0 ) ≤ 21 . So to maximize the separation v > ∆ while
guaranteeing a constant probability of error, we (approximately) solve
maximize v > ∆
1
subject to DA (θ0 + ∆, θ0 ) ≤ 2n .
3
Now, consider that DA (θ0 + ∆, θ0 ) = 12 ∆> ∇2 A(θ0 )∆ + O(k∆k ). Ignoring the higher order term,
we consider maximizing v > ∆ subject to ∆> ∇2 A(θ0 )∆ ≤ n1 . A Lagrangian calculation shows that
this has solution
1 1
∆= √ p ∇2 A(θ0 )−1 v.
n v > ∇2 A(θ0 )−1 v
p
With this choice, we have separation δ = v > ∆ = v > ∇2 A(θ0 )−1 v/n, and DA (θ0 + ∆, θ0 ) =
1 3/2
2n + O(1/n ). The summed probability of error is at least
r r
n 1 1
n n
1 − kP0 − P1 kTV ≥ 1 − + O(n −1/2 )=1− + O(n−1/2 ) = + O(n−1/2 )
4n 4 2
as desired.

Let us briefly sketch out why Proposition 3.5.1 is the “right” answer using the heuristics in Sec-
tion 3.2.1. For an unknown parameter θ in the exponential family model Pθ , we observe X1 , . . . , Xn ,
and wish to test whether v > θ ≥ t for a given threshold t. Call our null H0 : v > θ ≤ t, and assume
we wish to test at an asymptotic level α > 0, meaning the probability the test falsely rejects H0 is
(as n → ∞) is at most α. Assuming the heuristic (3.2.4), we have the approximate distributional
equality  
· 1
v > θbn ∼ N v > θ, v > ∇2 A(θbn )−1 v .
n
Note that we have θbn on the right side of the distribution; it is possible to make this rigorous, but
here we target only intuition building. A natural asymptotically level α test is then
( q
Reject if v > θbn ≥ t + z1−α v > ∇2 A(θbn )−1 v/n
Tn :=
Accept otherwise,
where z1−α is the 1 − α quantile of a standard normal, P(Z ≥ z1−α ) = α for Z ∼ N(0, 1). Let θ0
be such that v > θ0 = t, so H0 holds. Then

 q 
> b > 2 −1
Pθ0 (Tn rejects) = Pθ0 n · v (θn − θ0 ) ≥ z1−α v ∇ A(θn ) v → α.
b
p √
At least heuristically, then, this separation δ = v > A(θ0 )−1 v/ n is the fundamental separation
in parameter values at which testing becomes possible (or below which it is impossible).
As a brief and suggestive aside, the precise growth of the KL-divergence Dkl (Pθ0 +∆ ||Pθ0 ) =
1 > 2 3
2 ∆ ∇ A(θ0 )∆ + O(k∆k ) near θ0 plays the fundamental role in both the lower bound and upper
bound on testing. When the Hessian ∇2 A(θ0 ) is “large,” meaning it is very positive definite,
distributions with small parameter distances are still well-separated in KL-divergence, making
testing easy, while when ∇2 A(θ0 ) is small (nearly indefinite), the KL-divergence can be small even
for large parameter separations ∆ and testing is hard. As a consequence, at least for exponential
family models, the Fisher information (3.3.2), which we defined as ∇2 A(θ) = Covθ (φ(X)), plays a
central role in testing and, as we see later, estimation.

57
Lexture Notes on Statistics and Information Theory John Duchi

3.6 Deferred proofs


We collect proofs that rely on background we do not assume for this book here.

3.6.1 Proof of Proposition 3.2.2


We follow Brown [39]. We demonstrate only the first-order differentiability using Lebesgue’s domi-
nated convergence theorem , as higher orders and the interchange of integration and differentiation
are essentially identical. Demonstrating first-order complex differentiability is of course enough to
show that A is analytic.1 As the proof of Proposition 3.2.1 does not rely on analyticity of A, we
may use its results. Thus, let Θ = dom A(·) in Rd , which is convex. We assume Θ has non-empty
interior (if the interior is empty, then the convexity of Θ means that it must lie in a lower dimen-
sional subspace; we simply take the interior relative to that subspace and may proceed). We claim
the following lemma, which is the key to applying dominated convergence; we state it first for Rd .

Lemma 3.6.1. Consider any collection {θ1 , . . . , θm } ⊂ Θ, and let Θ0 = Conv{θi }m i=1 and C ⊂
int Θ0 . Then for any k ∈ N, there exists a constant K = K(C, k, {θi }) such that for all θ0 ∈ C,

kxkk exp(hθ0 , xi) ≤ K max exp(hθj , xi).


j≤m

Proof Let B = {u ∈ Rd | kuk ≤ 1} be the unit ball in Rd . For any  > 0, there exists a K = K()
such that kxkk ≤ Kekxk for all x ∈ Rd . As C ⊂ int Conv(Θ0 ), there exists an  > 0 such P that for
all θ0 ∈ C, θ0 + 2B ⊂ Θ0 , and by construction, for any u ∈ B we can write θ0 + 2u = m j=1 λj θj
m >
for some λ ∈ R+ with 1 λ = 1. We therefore have

kxkk exp(hθ0 , xi) ≤ kxkk sup exp(hθ0 + u, xi)


u∈B
k
= kxk exp( kxk) exp(hθ0 , xi) ≤ K exp(2 kxk) exp(hθ0 , xi)
= K sup exp(hθ0 + 2u, xi).
u∈B

But using the convexity of t 7→ exp(t) and that θ0 + 2u ∈ Θ0 , the last quantity has upper bound

sup exp(hθ0 + 2u, xi) ≤ max exp(hθj , xi).


u∈B j≤m

This gives the desired claim.

A similar result is possible with differences of exponentials:

Lemma 3.6.2. Under the conditions of Lemma 3.6.1, there exists a K such that for any θ, θ0 ∈ C

ehθ,xi − ehθ0 ,xi


≤ K max ehθj ,xi .
kθ − θ0 k j≤m

Proof We write
exp(hθ, xi) − exp(hθ0 , xi) exp(hθ − θ0 , xi) − 1
= exp(hθ0 , xi)
kθ − θ0 k kθ − θ0 k
1
For complex functions, Osgood’s lemma shows that if A is continuous and holomorphic in each variable individ-
ually, it is holomorphic. For a treatment of such ideas in an engineering context, see, e.g. [92, Ch. 1].

58
Lexture Notes on Statistics and Information Theory John Duchi

so that the lemma is equivalent to showing that


|ehθ−θ0 ,xi − 1|
≤ K max exp(hθj − θ0 , xi).
kθ − θ0 k j≤m

From this, we can assume without loss of generality that θ0 = 0 (by shifting). Now note that
by convexity e−a ≥ 1 − a for all a ∈ R, so 1 − ea ≤ |a| when a ≤ 0. Conversely, if a > 0, then
d
aea ≥ ea − 1 (note that da (aea ) = aea + ea ≥ ea ), so dividing by kxk, we see that

|ehθ,xi − 1| |ehθ,xi − 1| max{hθ, xiehθ,xi , |hθ, xi|}


≤ ≤ ≤ ehθ,xi + 1.
kθk kxk |hθ, xi| |hθ, xi|
As θ ∈ C, Lemma 3.6.1 then implies that
|ehθ,xi − 1|  
≤ kxk ehθ,xi + 1 ≤ K max ehθj ,xi ,
kθk j

as desired.

With the lemmas in hand, we can demonstrate a dominating function for the derivatives. Indeed,
fix θ0 ∈ int Θ and for θ ∈ Θ, define
exp(hθ, xi) − exp(hθ0 , xi) − exp(hθ0 , xi)hx, θ − θ0 i ehθ,xi − ehθ0 ,xi − h∇ehθ0 ,xi , θ − θ0 i
g(θ, x) = = .
kθ − θ0 k kθ − θ0 k
Then limθ→θ0 g(θ, x) = 0 by the differentiability of t 7→ et . Lemmas 3.6.1 and 3.6.2 show that if
we take any collection {θj }m
j=1 ⊂ Θ for which θ ∈ int Conv{θj }, then for C ⊂ int Conv{θj }, there
exists a constant K such that
| exp(hθ, xi) − exp(hθ0 , xi)|
|g(θ, x)| ≤ + kxk exp(hθ0 , xi) ≤ K max exp(hθj , xi)
kθ − θ0 k j
Pm
for all θ ∈ C. As maxj ehθj ,xi dµ(x) ≤ hθj ,xi dµ(x) < ∞, the dominated convergence
R R
j=1 e
theorem thus implies that Z
lim g(θ, x)dµ(x) = 0,
θ→θ0

and so M (θ) = exp(A(θ)) is differentiable in θ, as


Z 
M (θ) = M (θ0 ) + xehθ0 ,xi dµ(x), θ − θ0 + o(kθ − θ0 k).

It is evident that we have the derivative


Z
∇M (θ) = ∇ exp(hθ, xi)dµ(x).


Analyticity Over the subset ΘC := {θ + iz | θ ∈ Θ, z ∈ Rd } (where i = −1 is the imaginary
unit), we can extend the preceding results to demonstrate that A is analytic on ΘC . Indeed, we
first simply note that for a, b ∈ R, exp(a + ib) = exp(a) exp(ib) and | exp(a + ib)| = exp(a), i.e.
|ez | = e z for z ∈ C, and so Lemmas 3.6.1 and 3.6.2 follow mutatis-mutandis as in the real case.
These are enough for the application of the dominated convergence theorem above, and we use that
exp(·) is analytic to conclude that θ 7→ M (θ) is analytic on ΘC .

59
Lexture Notes on Statistics and Information Theory John Duchi

3.7 Bibliography

3.8 Exercises
Exercise 3.1: In Example 3.1.4, give the sufficient statistic φ and an explicit formula for the log
partition function A(θ, Θ) so that we can write pθ,Θ (x) = exp(hθ, φ1 (x)i + hΘ, φ2 (x)i − A(θ, Θ)).
Exercise 3.2: Consider the binary logistic regression model in Example 3.4.2, and let `(θ; x, y) =
− log pθ (y | x) be the associated log loss.

(i) Give the Hessian ∇2θ `(θ; x, y).

(ii) Let (xi , yi )ni=1 ⊂ Rd × {±1} be a sample. Give a sufficient condition for the minimizer of the
empirical log loss
n
1X
Ln (θ) := `(θ; xi , yi )
n
i=1

to be unique that depends only on the vectors {xi }. Hint. A convex function h is strictly
convex if and only if its Hessian ∇2 h is positive definite.

60
Part I

Concentration, information, stability,


and generalization

61
Chapter 4

Concentration Inequalities

In many scenarios, it is useful to understand how a random variable X behaves by giving bounds
on the probability that it deviates far from its mean or median. This can allow us to give prove
that estimation and learning procedures will have certain performance, that different decoding and
encoding schemes work with high probability, among other results. In this chapter, we give several
tools for proving bounds on the probability that random variables are far from their typical values.
We conclude the section with a discussion of basic uniform laws of large numbers and applications
to empirical risk minimization and statistical learning, though we focus on the relatively simple
cases we can treat with our tools.

4.1 Basic tail inequalities


In this first section, we have a simple to state goal: given a random variable X, how does X
concentrate around its mean? That is, assuming w.l.o.g. that E[X] = 0, how well can we bound

P(X ≥ t)?

We begin with the three most classical three inequalities for this purpose: the Markov, Chebyshev,
and Chernoff bounds, which are all instances of the same technique.
The basic inequality off of which all else builds is Markov’s inequality.

Proposition 4.1.1 (Markov’s inequality). Let X be a nonnegative random variable, meaning that
X ≥ 0 with probability 1. Then
E[X]
P(X ≥ t) ≤ .
t
Proof For any random variable, P(X ≥ t) = E[1 {X ≥ t}] ≤ E[(X/t)1 {X ≥ t}] ≤ E[X]/t, as
X/t ≥ 1 whenever X ≥ t.

When we know more about a random variable than that its expectation is finite, we can give
somewhat more powerful bounds on the probability that the random variable deviates from its
typical values. The first step in this direction, Chebyshev’s inequality, requires two moments, and
when we have exponential moments, we can give even stronger results. As we shall see, each of
these results is but an application of Proposition 4.1.1.

62
Lexture Notes on Statistics and Information Theory John Duchi

Proposition 4.1.2 (Chebyshev’s inequality). Let X be a random variable with Var(X) < ∞. Then

Var(X) Var(X)
P(X − E[X] ≥ t) ≤ and P(X − E[X] ≤ −t) ≤
t2 t2
for all t ≥ 0.

Proof We prove only the upper tail result, as the lower tail is identical. We first note that
X − E[X] ≥ t implies that (X − E[X])2 ≥ t2 . But of course, the random variable Z = (X − E[X])2
is nonnegative, so Markov’s inequality gives P(X − E[X] ≥ t) ≤ P(Z ≥ t2 ) ≤ E[Z]/t2 , and
E[Z] = E[(X − E[X])2 ] = Var(X).

If a random variable has a moment generating function—exponential moments—we can give


bounds that enjoy very nice properties when combined with sums of random variables. First, we
recall that
ϕX (λ) := E[eλX ]
is the moment generating function of the random variable X. Then we have the Chernoff bound.

Proposition 4.1.3. For any random variable X, we have

E[eλX ]
P(X ≥ t) ≤ = ϕX (λ)e−λt
eλt
for all λ ≥ 0.

Proof This is another application of Markov’s inequality: for λ > 0, we have eλX ≥ eλt if and
only if X ≥ t, so that P(X ≥ t) = P(eλX ≥ eλt ) ≤ E[eλX ]/eλt .

In particular, taking the infimum over all λ ≥ 0 in Proposition 4.1.3 gives the more standard
Chernoff (large deviation) bound
 
P(X ≥ t) ≤ exp inf log ϕX (λ) − λt .
λ≥0

Example 4.1.4 (Gaussian random variables): When X is a mean-zero Gaussian variable


with variance σ 2 , we have

λ2 σ 2
 
ϕX (λ) = E[exp(λX)] = exp . (4.1.1)
2

To see this, we compute the integral; we have


Z ∞  
1 1 2
E[exp(λX)] = √ exp λx − 2 x dx
−∞ 2πσ 2 2σ
Z ∞  
2
λ σ 2 1 1 2 2
=e 2 √ exp − 2 (x − λσ x) dx,
−∞ 2πσ 2 2σ
| {z }
=1

because this is simply the integral of the Gaussian density.

63
Lexture Notes on Statistics and Information Theory John Duchi

As a consequence of the equality (4.1.1) and the Chernoff bound technique (Proposition 4.1.3),
we see that for X Gaussian with variance σ 2 , we have
t2 t2
   
P(X ≥ E[X] + t) ≤ exp − 2 and P(X ≤ E[X] − t) ≤ exp − 2
2σ 2σ
λ2 σ 2 2 2 2
for all t ≥ 0. Indeed, we have log ϕX−E[X] (λ) = 2 , and inf λ { λ 2σ − λt} = − 2σ
t
2 , which is
attained by λ = σt2 . 3

4.1.1 Sub-Gaussian random variables


Gaussian random variables are convenient for their nice analytical properties, but a broader class
of random variables with similar moment generating functions are known as sub-Gaussian random
variables.
Definition 4.1. A random variable X is sub-Gaussian with parameter σ 2 if
 2 2
λ σ
E[exp(λ(X − E[X]))] ≤ exp
2
for all λ ∈ R. We also say such a random variable is σ 2 -sub-Gaussian.
Of course, Gaussian random variables satisfy Definition 4.1 with equality. This would be un-
interesting if only Gaussian random variables satisfied this property; happily, that is not the case,
and we detail several examples.

Example 4.1.5 (Random signs (Rademacher variables)): The random variable X taking
values {−1, 1} with equal property is 1-sub-Gaussian. Indeed, we have
∞ ∞ ∞ ∞
1 X λk 1 X (−λ)k λ2k (λ2 )k
 2
1 1 X X λ
E[exp(λX)] = eλ + e−λ = + = ≤ = exp ,
2 2 2 k! 2 k! (2k)! 2k k! 2
k=0 k=0 k=0 k=0

as claimed. 3

Bounded random variables are also sub-Gaussian; indeed, we have the following example.
Example 4.1.6 (Bounded random variables): Suppose that X is bounded, say X ∈ [a, b].
Then Hoeffding’s lemma states that
λ2 (b − a)2
 
E[eλ(X−E[X]) ] ≤ exp ,
8
so that X is (b − a)2 /4-sub-Gaussian.
We prove a somewhat weaker statement with a simpler argument, while Exercise 4.1 gives one
approach to proving the above statement. First, let ε ∈ {−1, 1} be a Rademacher variable,
so that P(ε = 1) = P(ε = −1) = 12 . We apply a so-called symmetrization technique—a
common technique in probability theory, statistics, concentration inequalities, and Banach
space research—to give a simpler bound. Indeed, let X 0 be an independent copy of X, so that
E[X 0 ] = E[X]. We have

ϕX−E[X] (λ) = E exp(λ(X − E[X 0 ])) ≤ E exp(λ(X − X 0 ))


   

= E exp(λε(X − X 0 )) ,
 

64
Lexture Notes on Statistics and Information Theory John Duchi

where the inequality follows from Jensen’s inequality and the last equality is a conseqence of
the fact that X − X 0 is symmetric about 0. Using the result of Example 4.1.5,
λ (X − X 0 )
 2  2
λ (b − a)2
  
0
 
E exp(λε(X − X )) ≤ E exp ≤ exp ,
2 2
where the final inequality is immediate from the fact that |X − X 0 | ≤ b − a. 3
While Example 4.1.6 shows how a symmetrization technique can give sub-Gaussian behavior,
more sophisticated techniques involving explicitly bounding the logarithm of the moment generating
function of X, often by calculations involving exponential tilts of its density. In particular, letting
X be mean zero for simplicity, if we let

ψ(λ) = log ϕX (λ) = log E[eλX ],

then
E[XeλX ] E[X 2 eλX ] E[XeλX ]2
ψ 0 (λ) = and ψ 00
(λ) = − ,
E[eλX ] E[eλX ] E[eλX ]2
where we can interchange the order of taking expectations and derivatives whenever ψ(λ) is finite.
Notably, if X has density pX (with respect to any base measure) then the random variable Yλ with
density
eλy
pλ (y) = pX (y)
E[eλX ]
(with respect to the same base measure) satisfies

ψ 0 (λ) = E[Yλ ] and ψ 00 (λ) = E[Yλ2 ] − E[Yλ ]2 = Var(Yλ ).

One can exploit this in many ways, which the exercises and coming chapters do. As a particular
example, we can give sharper sub-Gaussian constants for Bernoulli random variables.
Example 4.1.7 (Bernoulli random variables): Let X be Bernoulli(p), so that X = 1 with
probability p and X = 0 otherwise. Then a strengthening of Hoeffding’s lemma (also, essen-
tially, due to Hoeffding) is that

σ 2 (p) 2 1 − 2p
log E[eλ(X−p) ] ≤ λ for σ 2 (p) := .
2 2 log 1−p
p

Here we take the limits as p → {0, 21 , 1} and have σ 2 (0) = 0, σ 2 (1) = 0, and σ 2 ( 12 ) = 14 .
Because p 7→ σ 2 (p) is concave and symmetric about p = 12 , this inequality is always sharper
than that of Example 4.1.6. Exercise 4.9 gives one proof of this bound exploiting exponential
tilting. 3
Chernoff bounds for sub-Gaussian random variables are immediate; indeed, they have the same
concentration properties as Gaussian random variables, a consequence of the nice analytical prop-
erties of their moment generating functions (that their logarithms are at most quadratic). Thus,
using the technique of Example 4.1.4, we obtain the following proposition.
Proposition 4.1.8. Let X be a σ 2 -sub-Gaussian. Then for all t ≥ 0 we have
t2
 
P(X − E[X] ≥ t) ∨ P(X − E[X] ≤ −t) ≤ exp − 2 .

65
Lexture Notes on Statistics and Information Theory John Duchi

Chernoff bounds extend naturally to sums of independent random variables, because moment
generating functions of sums of independent random variables become products of moment gener-
ating functions.

Proposition 4.1.9. Let X1 , X2 , . . . , Xn be independent σi2 -sub-Gaussian random variables. Then


n
" #  2 Pn
 X
λ 2
i=1 σi
E exp λ (Xi − E[Xi ]) ≤ exp for all λ ∈ R,
2
i=1
Pn Pn 2
that is, i=1 Xi is i=1 σi -sub-Gaussian.

Proof We assume w.l.o.g. that the Xi are mean zero. We have by independence that and
sub-Gaussianity that
  Xn    n−1
X   2 2   n−1
X 
λ σn
E exp λ Xi = E exp λ Xi E[exp(λXn )] ≤ exp E exp λ Xi .
2
i=1 i=1 i=1

Applying this technique inductively to Xn−1 , . . . , X1 , we obtain the desired result.

Two immediate corollary to Propositions 4.1.8 and 4.1.9 show that sums of sub-Gaussian random
variables concentrate around their expectations. We begin with a general concentration inequality.

Corollary 4.1.10. Let Xi be independent σi2 -sub-Gaussian random variables. Then for all t ≥ 0
(  n n )
t2
X  X  
max P (Xi − E[Xi ]) ≥ t , P (Xi − E[Xi ]) ≤ −t ≤ exp − Pn .
i=1 i=1
2 i=1 σi2

Additionally, the classical Hoeffding bound, follows when we couple Example 4.1.6 with Corol-
lary 4.1.10: if Xi ∈ [ai , bi ], then
n
2t2
X   
P (Xi − E[Xi ]) ≥ t ≤ exp − Pn 2
.
i=1 i=1 (bi − ai )

To give another interpretation of these inequalities, let us assume that Xi are indepenent and
σ 2 -sub-Gaussian. Then we have that
n
nt2
 X   
1
P (Xi − E[Xi ]) ≥ t ≤ exp − 2 ,
n 2σ
i=1
q
1
nt2 2σ 2 log δ
or, for δ ∈ (0, 1), setting exp(− 2σ 2) = δ or t = √
n
, we have that
q
1X
n 2σ 2 log 1δ
(Xi − E[Xi ]) ≤ √ with probability at least 1 − δ.
n n
i=1

There are a variety of other conditions equivalent to sub-Gaussianity, which we capture in the
following theorem.

66
Lexture Notes on Statistics and Information Theory John Duchi

Theorem 4.1.11. Let X be a mean-zero random variable and σ 2 ≥ 0 be a constant. The following
statements are all equivalent, meaning that there are numerical constant factors Kj such that if one
statement (i) holds with parameter Ki , then statement (j) holds with parameter Kj ≤ CKi , where
C is a numerical constant.
2
(1) Sub-gaussian tails: P(|X| ≥ t) ≤ 2 exp(− Kt1 σ2 ) for all t ≥ 0.

(2) Sub-gaussian moments: E[|X|k ]1/k ≤ K2 σ k for all k.

(3) Super-exponential moment: E[exp(X 2 /(K3 σ 2 ))] ≤ e.

(4) Sub-gaussian moment generating function: E[exp(λX)] ≤ exp(K4 λ2 σ 2 ) for all λ ∈ R.

Particularly,
q (1) implies (2) with K1 = 1 and K2 ≤ e1/e ; (2) implies (3) with K2 = 1 and
2
K3 = e e−1 < 3; (3) implies (4) with K3 = 1 and K4 ≤ 43 ; and (4) implies (1) with K4 = 12 and
K1 ≤ 2.

This result is standard in the literature on concentration and random variables, but see Ap-
pendix 4.5.1 for a proof of this theorem.
For completeness, we can give a tighter result than part (3) of the preceding theorem, giving a
concrete upper bound on squares of sub-Gaussian random variables. The technique used in the ex-
ample, to introduce an independent random variable for auxiliary randomization, is a common and
useful technique in probabilistic arguments (similar to our use of symmetrization in Example 4.1.6).

Example 4.1.12 (Sub-Gaussian squares): Let X be a mean-zero σ 2 -sub-Gaussian random


variable. Then
1
E[exp(λX 2 )] ≤ 1 , (4.1.2)
[1 − 2σ 2 λ]+2
and expression (4.1.2) holds with equality for X ∼ N(0, σ 2 ).
To see this result, we focus on the Gaussian case first and assume (for this case) without loss
of generality (by scaling) that σ 2 = 1. Assuming that λ < 12 , we have
Z Z √
1 1 2 1 − 1−2λ z 2 2π 1
2
E[exp(λZ )] = √ e−( 2 −λ)z dz = √ e 2 dz = √ √ ,
2π 2π 1 − 2λ 2π
the final equality a consequence of the fact that (as we know for normal random variables)
R − 1 z2 √
e 2σ2 dz = 2πσ 2 . When λ ≥ 12 , the above integrals are all infinite, giving the equality in
expression (4.1.2).
For the more general inequality, we recall that if Z is an independent N(0, 1) random variable,
2
then E[exp(tZ)] = exp( t2 ), and so

√ (i)  (ii) 1
E[exp(λX 2 )] = E[exp( 2λXZ)] ≤ E exp(λσ 2 Z 2 ) =

1 ,
[1 − 2σ 2 λ]+2

where inequality (i) follows because X is sub-Gaussian, and inequality (ii) because Z ∼ N(0, 1).
3

67
Lexture Notes on Statistics and Information Theory John Duchi

4.1.2 Sub-exponential random variables


A slightly weaker condition than sub-Gaussianity is for a random variable to be sub-exponential,
which—for a mean-zero random variable—means that its moment generating function exists in a
neighborhood of zero.
Definition 4.2. A random variable X is sub-exponential with parameters (τ 2 , b) if for all λ such
that |λ| ≤ 1/b,  2 2
λ τ
E[eλ(X−E[X]) ] ≤ exp .
2
It is clear from Definition 4.2 that a σ 2 -sub-Gaussian random variable is (σ 2 , 0)-sub-exponential.
A variety of random variables are sub-exponential. As a first example, χ2 -random variables are
sub-exponential with constant values for τ and b:
Example 4.1.13: Let X = Z 2 , where Z ∼ N(0, 1). We claim that
1
E[exp(λ(X − E[X]))] ≤ exp(2λ2 ) for λ ≤ . (4.1.3)
4
1
Indeed, for λ < 2 we have that
  (i)
1
E[exp(λ(Z − E[Z ]))] = exp − log(1 − 2λ) − λ ≤ exp λ + 2λ2 − λ
2 2

2

where inequality (i) holds for λ ≤ 14 , because − log(1 − 2λ) ≤ 2λ + 4λ2 for λ ≤ 14 . 3
As a second example, we can show that bounded random variables are sub-exponential. It is
clear that this is the case as they are also sub-Gaussian; however, in many cases, it is possible to
show that their parameters yield much tighter control over deviations than is possible using only
sub-Gaussian techniques.
Example 4.1.14 (Bounded random variables are sub-exponential): Suppose that X is a
mean zero random variable taking values in [−b, b] with variance σ 2 = E[X 2 ] (note that we are
guaranteed that σ 2 ≤ b2 in this case). We claim that
 2 2
3λ σ 1
E[exp(λX)] ≤ exp for |λ| ≤ . (4.1.4)
5 2b

To see this, note first that for k ≥ 2 we have E[|X|k ] ≤ E[X 2 bk−2 ] = σ 2 bk−2 . Then by an
expansion of the exponential, we find
∞ ∞
λ2 E[X 2 ] X λk E[X k ] λ2 σ 2 X λk σ 2 bk−2
E[exp(λX)] = 1 + E[λX] + + ≤1+ +
2 k! 2 k!
k=3 k=3

λ2 σ 2 X (λb)k (i) λ2 σ 2 λ2 σ 2
=1+ + λ2 σ 2 ≤ 1+ + ,
2 (k + 2)! 2 10
k=1
1
inequality (i) holding for λ ≤ 2b . Using that 1 + x ≤ ex gives the result.
It is possible to give a slightly tighter result for λ ≥ 0 In this case, we have the bound

λ2 σ 2 2 2
X λk−2 bk−2 σ2  
E[exp(λX)] ≤ 1 + +λ σ = 1 + 2 eλb − 1 − λb .
2 k! b
k=3

68
Lexture Notes on Statistics and Information Theory John Duchi

Then using that 1 + x ≤ ex , we obtain Bennett’s moment generating inequality, which is that
 2 
λX σ λb
E[e ] ≤ exp e − 1 − λb for λ ≥ 0. (4.1.5)
b2
λ2 b2
Inequality (4.1.5) always holds, and for λb near 0, we have eλb − 1 − λb ≈ 2 . 3

In particular, if the variance σ 2  b2 , the absolute bound on X, inequality (4.1.4) gives much
tighter control on the moment generating function of X than typical sub-Gaussian bounds based
only on the fact that X ∈ [−b, b] allow.
More broadly, we can show a result similar to Theorem 4.1.11.

Theorem 4.1.15. Let X be a random variable and σ ≥ 0. Then—in the sense of Theorem 4.1.11—
the following statements are all equivalent for suitable numerical constants K1 , . . . , K4 .

(1) Sub-exponential tails: P(|X| ≥ t) ≤ 2 exp(− Kt1 σ ) for all t ≥ 0

(2) Sub-exponential moments: E[|X|k ]1/k ≤ K2 σk for all k ≥ 1.

(3) Existence of moment generating function: E[exp(X/(K3 σ))] ≤ e.

(4) If, in addition, E[X] = 0, then E[exp(λX)] ≤ exp(K4 λ2 σ 2 ) for all |λ| ≤ K40 /σ.

In particular, if (2) holds with K2 = 1, then (4) holds with K4 = 2e2 and K40 = 1
2e .

The proof, which is similar to that for Theorem 4.1.11, is presented in Section 4.5.2.
While the concentration properties of sub-exponential random variables are not quite so nice
as those for sub-Gaussian random variables (recall Hoeffding’s inequality, Corollary 4.1.10), we
can give sharp tail bounds for sub-exponential random variables. We first give a simple bound on
deviation probabilities.

Proposition 4.1.16. Let X be a mean-zero (τ 2 , b)-sub-exponential random variable. Then for all
t ≥ 0,   2 
1 t t
P(X ≥ t) ∨ P(X ≤ −t) ≤ exp − min , .
2 τ2 b
Proof The proof is an application of the Chernoff bound technique; we prove only the upper tail
as the lower tail is similar. We have
E[eλX ] (i)
 2 2 
λ τ
P(X ≥ t) ≤ ≤ exp − λt ,
eλt 2

inequality (i) holding for |λ| ≤ 1/b. To minimize the last term in λ, we take λ = min{ τt2 , 1/b},
which gives the result.

Comparing with sub-Gaussian random variables, which have b = 0, we see that Proposition 4.1.16
gives a similar result for small t—essentially the same concentration sub-Gaussian random variables—
while for large t, the tails decrease only exponentially in t.
We can also give a tensorization identity similar to Proposition 4.1.9.

69
Lexture Notes on Statistics and Information Theory John Duchi

Proposition 4.1.17. Let X1 , . . . , Xn be independent mean-zero sub-exponential random variables,


where Xi is (σi2 , bi )-sub-exponential. Then for any vector ai ∈ Rn , we have
n
" !#  2 Pn
λ 2 2
i=1 ai σi 1
X
E exp λ Xi ≤ exp for |λ| ≤ ,
2 b∗
i=1
Pn 2 2 1
where b∗ = maxi bi |ai |. That is, ha, Xi is ( i=1 ai σi , mini bi |ai | )-sub-exponential.

Proof We apply an inductive technique similar to that used in the proof of Proposition 4.1.9.
1
First, for any fixed i, we know that if |λ| ≤ bi |ai|
, then |ai λ| ≤ b1i and so

λ2 a2i σi2
 
E[exp(λai Xi )] ≤ exp .
2
1
Now, we inductively apply the preceding inequality, which applies so long as |λ| ≤ bi |ai | for all i.
We have
n n n
"  X # Y  2 2 2
Y λ ai σi
E exp λ ai Xi = E[exp(λai Xi )] ≤ exp ,
2
i=1 i=1 i=1

which is our desired result.

As in the case of sub-Gaussian random variables, a combination of the tensorization property—


that the moment generating functions of sums of sub-exponential random variables are well-
behaved—of Proposition 4.1.17 and the concentration inequality (4.1.16) immediately yields the
following Bernstein-type inequality. (See also Vershynin [170].)

Corollary 4.1.18. Let X1 , . . . , Xn be independent mean-zero (σi2 , bi )-sub-exponential random vari-


ables (Definition 4.2). Define b∗ := maxi bi . Then for all t ≥ 0 and all vectors a ∈ Rn , we
have
n n
t2
X  X    
1 t
P ai Xi ≥ t ∨ P ai Xi ≤ −t ≤ exp − min Pn 2 2, .
i=1 i=1
2 i=1 ai σi b∗ kak∞

It is instructive to study the structure of the bound of Corollary 4.1.18. Notably, the bound
is similar to the Hoeffding-type bound of Corollary 4.1.10 (holding for σ 2 -sub-Gaussian random
variables) that
n
!
t2
X 
P ai Xi ≥ t ≤ exp − ,
i=1
2 kak22 σ 2
so that for small t, Corollary 4.1.18 gives sub-Gaussian tail behavior. For large t, the bound is
weaker. However, in many cases, Corollary 4.1.18 can give finer control than naive sub-Gaussian
bounds. Indeed, suppose that the random variables Xi are i.i.d., mean zero, and satisfy Xi ∈ [−b, b]
with probability 1, but have variance σ 2 = E[Xi2 ] ≤ b2 as in Example 4.1.14. Then Corollary 4.1.18
implies that
n
( )!
5 t2
X 
1 t
P ai Xi ≥ t ≤ exp − min , . (4.1.6)
2 6 σ 2 kak22 2b kak∞
i=1

70
Lexture Notes on Statistics and Information Theory John Duchi

When applied to a standard mean (and with a minor simplification that 5/12 < 1/3) with ai = n1 ,
t2
we obtain the bound that n1 ni=1 Xi ≤ t with probability at least 1−exp(−n min{ 3σ t
P
2 , 4b }). Written
q
3 log 1δ 4b log 1δ
differently, we take t = max{σ n , n } to obtain
 q 
1
n
X  3 log 1δ 4b log 1 
δ
Xi ≤ max σ √ , with probability 1 − δ.
n  n n 
i=1

q √
The sharpest such bound possible via more naive Hoeffding-type bounds is b 2 log 1δ / n, which
has substantially worse scaling.

Further conditions and examples


There are a number of examples and conditions sufficient for random variables to be sub-exponential.
One common condition, the so-called Bernstein condition, controls the higher moments of a random
variable X by its variance. In this case, we say that X satisfies the b-Bernstein condition if
k! 2 k−2
|E[(X − µ)k ]| ≤ σ b for k = 3, 4, . . . , (4.1.7)
2
where µ = E[X] and σ 2 = Var(X) = E[X 2 ] − µ2 . In this case, the following lemma controls
the moment generating function of X. This result is essentially present in Theorem 4.1.15, but it
provides somewhat tighter control with precise constants.

Lemma 4.1.19. Let X be a random variable satisfying the Bernstein condition (4.1.7). Then

λ2 σ 2
 
h
λ(X−µ)
i 1
E e ≤ exp for |λ| ≤ .
2(1 − b|λ|) b

Said differently, a random variable satisfying Condition (4.1.7) is ( 2σ, b/2)-sub-exponential.
Proof Without loss of generality we assume µ = 0. We expand the moment generating function
by noting that
∞ ∞
λX λ2 σ 2 X λk E[X k ] (i) λ2 σ 2 λ2 σ 2 X
E[e ]=1+ + ≤ 1+ + |λb|k−2
2 k! 2 2
k=3 k=3
λ2 σ 2 1
=1+
2 [1 − b|λ|]+

where inequality (i) used the Bernstein condition (4.1.7). Noting that 1+x ≤ ex gives the result.

As one final example, we return to Bennett’s inequality (4.1.5) from Example 4.1.14.

Proposition 4.1.20 (Bennett’s inequality). Let Xi be independent mean-zero P random variables


with Var(Xi ) = σi2 and |Xi | ≤ b. Then for h(t) := (1 + t) log(1 + t) − t and σ 2 := ni=1 σi2 , we have
n
!  2  
X σ bt
P Xi ≥ t ≤ exp − 2 h .
b σ2
i=1

71
Lexture Notes on Statistics and Information Theory John Duchi

Proof We assume without loss of generality that E[X] = 0. Using the standard Chernoff bound
argument coupled with inequality (4.1.5), we see that
n n
! !
X X X σi2  λb 
P Xi ≥ t ≤ exp e − 1 − λb − λt .
b2
i=1 i=1

Letting h(t) = (1 + t) log(1 + t) − t as in the statement of the proposition and σ 2 = ni=1 σi2 , we
P
minimize over λ ≥ 0, setting λ = 1b log(1 + σbt2 ). Substituting into our Chernoff bound application
gives the proposition.

A slightly more intuitive writing of Bennett’s inequality is to use averages, in which case for
σ 2 = n1 ni=1 σi2 the average of the variances,
P

n
!
nσ 2
  
1X bt
P Xi ≥ t ≤ exp − h .
n b σ2
i=1

It is possible to show that


nσ 2 nt2
 
bt
h ≥ ,
b σ2 2σ 2 + 23 bt
which gives rise to the classical Bernstein inequality that
n
! !
1X nt2
P Xi ≥ t ≤ exp − 2 2 . (4.1.8)
n 2σ + 3 bt
i=1

4.1.3 Orlicz norms


Sub-Gaussian and sub-exponential random variables are examples of a broader class of random
variables belonging to what are known as Orlicz-spaces. For these, we take any convex function
ψ : R+ → R+ with ψ(0) = 0 and ψ(t) → ∞ as t ↑ ∞, a class called the Orlicz functions. Then the
Orlicz norm of a random variable X is

kXkψ := inf {t > 0 | E[ψ(|X|/t)] ≤ 1} . (4.1.9)

That this is a norm is not completely trivial, though a few properties are immediate: clearly
kaXkψ = |a| kXkψ , and we have kXkψ = 0 if and only if X = 0 with probability 1. The key result
is that in fact, k·kψ is actually convex, which then guarantees that it is a norm.

Proposition 4.1.21. The function k·kψ is convex on the space of random variables.

Proof Because ψ is convex and non-decreasing, x 7→ ψ(|x|) is convex as well. (Convince yourself
of this.) Thus, its perspective transform pers(ψ)(t, |x|) := tψ(|x|/t) is jointly convex in both t ≥ 0
and x (see Appendix B.3.3). This joint convexity of ψe implies that for any random variables X0
and X1 and t0 , t1 ,

E[pers(ψ)(λt0 + (1 − λ)t1 , |λX0 + (1 − λ)X1 |)] ≤ λE[pers(ψ)(t0 , |X0 |)] + (1 − λ)E[pers(ψ)(t1 , |X1 |)].

Now note that E[ψ(|X|/t)] ≤ 1 if and only if tE[ψ(|X|/t)] ≤ t.

72
Lexture Notes on Statistics and Information Theory John Duchi

Because k·kψ is convex and positively homogeneous, we certainly have

kX + Y kψ = 2 k(X + Y )/2kψ ≤ kXkψ + kY kψ ,

that is, the triangle inequality holds.


We can recover several standard norms on random variables, including some we have already
implicitly used. The first are the classical Lp norms, where we take ψ(t) = tp , where we see that

inf{t > 0 | E[|X|p /tp ] ≤ 1} = E[|X|p ]1/p .

We also have what we term the sub-Gaussian and sub-Exponential norms, typically denoted by
considering the functions
ψp (x) := exp (|x|p ) − 1.
These induce the Orlicz ψp -norms, as for p ≥ 1, these are convex (as they are the composition of the
increasing convex function exp(·) applied to the nonnegative convex function | · |p ). Theorem 4.1.11
shows that we have a natural sub-Gaussian norm

kXkψ2 := inf t > 0 | E[exp(X 2 /t2 )] ≤ 2 ,



(4.1.10)

while Theorem 4.1.15 shows a natural sub-exponential norm (or Orlicz ψ1 -norm)

kXkψ1 := inf {t > 0 | E[exp(|X|/t)] ≤ 2} . (4.1.11)

Many relationships follow immediately from the definitions (4.1.10) and (4.1.11). For example,
any sub-Gaussian random variable (whether or not it is mean zero) has a square that is sub-
exponential:

Lemma 4.1.22. A random variable X is sub-Gaussian if and only if X 2 is sub-exponential, and


moreover,
kXk2ψ2 = X 2 ψ1 .

(This is immediate by definition.) By tracing through the arguments in the proofs of Theo-
rems 4.1.11 and 4.1.15, we can also see that an alternative definition of the two norms could
be
1 1
sup √ E[|X|k ]1/k and sup E[|X|k ]1/k
k∈N k k∈N k

for the sub-Gaussian and sub-exponential norms kXkψ2 and kXkψ1 , respectively. They are all
equivalent.

4.1.4 First applications of concentration: random projections


In this section, we investigate the use of concentration inequalities in random projections. As
motivation, consider nearest-neighbor (or k-nearest-neighbor) classification schemes. We have a
sequence of data points as pairs (ui , yi ), where the vectors ui ∈ Rd have labels yi ∈ {1, . . . , L},
where L is the number of possible labels. Given a new point u ∈ Rd that we wish to label, we find
the k-nearest neighbors to u in the sample {(ui , yi )}ni=1 , then assign u the majority label of these
k-nearest neighbors (ties are broken randomly). Unfortunately, it can be prohibitively expensive to
store high-dimensional vectors and search over large datasets to find near vectors; this has motivated
a line of work in computer science on fast methods for nearest neighbors based on reducing the

73
Lexture Notes on Statistics and Information Theory John Duchi

dimension while preserving essential aspects of the dataset. This line of research begins with Indyk
and Motwani [112], and continuing through a variety of other works, including Indyk [111] and
work on locality-sensitive hashing by Andoni et al. [6], among others. The original approach is due
to Johnson and Lindenstrauss, who used the results in the study of Banach spaces [117]; our proof
follows a standard argument.
The most specific variant of this problem is as follows: we have n points u1 , . . . , un , and we
could like to construct a mapping Φ : Rd → Rm , where m  d, such that
kΦui − Φuj k2 ∈ (1 ± ) kui − uj k2 .
Depending on the norm chosen, this task may be impossible; for the Euclidean (`2 ) norm, however,
such an embedding is easy to construct using Gaussian random variables and with m = O( 12 log n).
This embedding is known as the Johnson-Lindenstrauss embedding. Note that this size m is
independent of the dimension d, only depending on the number of points n.
Example 4.1.23 (Johnson-Lindenstrauss): Let the matrix Φ ∈ Rm×d be defined as follows:
iid
Φij ∼ N(0, 1/m),
and let Φi ∈ Rd denote the ith row of this matrix. We claim that
 
8 1
m ≥ 2 2 log n + log implies kΦui − Φuj k22 ∈ (1 ± ) kui − uj k22
 δ
log n
for all pairs ui , uj with probability at least 1 − δ. In particular, m & 2
is sufficient to achieve
accurate dimension reduction with high probability.
To see this, note that for any fixed vector u,
m
hΦi , ui kΦuk22 X
∼ N(0, 1/m), and = hΦi , u/ kuk2 i2
kuk2 kuk22 i=1

is a sum of independent scaled χ2 -random variables. In particular, we have E[kΦu/ kuk2 k22 ] = 1,
and using the χ2 -concentration result of Example 4.1.13 yields
   
P kΦuk22 / kuk22 − 1 ≥  = P m kΦuk22 / kuk22 − 1 ≥ m
m2
 
2

≤ 2 inf exp 2mλ − λm = 2 exp − ,
|λ|≤ 41 8

the last inequality holding for  ∈ [0, 1]. Now, using the union bound applied to each of the
pairs (ui , uj ) in the sample, we have
m2
   

2 2 2
 n
P there exist i 6= j s.t. kΦ(ui − uj )k2 − kui − uj k2 ≥  kui − uj k2 ≤ 2 exp − .
2 8
2
Taking m ≥ 82 log nδ = 16
2
log n + 82 log 1δ yields that with probability at least 1 − δ, we have
kΦui − Φuj k2 ∈ (1 ± ) kui − uj k22 . 3
2

Computing low-dimensional embeddings of high-dimensional data is an area of active research,


and more recent work has shown how to achieve sharper constants [57] and how to use more struc-
tured matrices to allow substantially faster computation of the embeddings Φu (see, for example,
Achlioptas [1] for early work in this direction, and Ailon and Chazelle [3] for the so-called “Fast
Johnson-Lindenstrauss transform”).

74
Lexture Notes on Statistics and Information Theory John Duchi

4.1.5 A second application of concentration: codebook generation


We now consider a (very simplified and essentially un-implementable) view of encoding a signal for
transmission and generation of a codebook for transmitting said signal. Suppose that we have a set
of words, or signals, that we wish to transmit; let us index them by i ∈ {1, . . . , m}, so that there are
m total signals we wish to communicate across a binary symmetric channel Q, meaning that given
an input bit x ∈ {0, 1}, Q outputs a z ∈ {0, 1} with Q(Z = x | x) = 1 −  and Q(Z = 1 − x | x) = ,
for some  < 12 . (For simplicity, we assume Q is memoryless, meaning that when the channel is
used multiple times on a sequence x1 , . . . , xn , its outputs Z1 , . . . , Zn are conditionally independent:
Q(Z1:n = z1:n | x1:n ) = Q(Z1 = z1 | x1 ) · · · Q(Zn = zn | xn ).)
We consider a simplified block coding scheme, where we for each i we associate a codeword
xi ∈ {0, 1}d , where d is a dimension (block length) to be chosen. Upon sending the codeword over
the channel, and receiving some z rec ∈ {0, 1}d , we decode by choosing

i∗ ∈ argmax Q(Z = z rec | xi ) = argmin kz rec xi k1 , (4.1.12)


i∈[m] i∈[m]

the maximum likelihood decoder. We now investigate how to choose a collection {x1 , . . . , xm }
of such codewords and give finite sample bounds on its probability of error. In fact, by using
concentration inequalities, we can show that a randomly drawn codebook of fairly small dimension
is likely to enjoy good performance.
Intuitively, if our codebook {x1 , . . . , xm } ⊂ {0, 1}d is well-separated, meaning that each pair of
words xi , xk satisfies kxi − xk k1 ≥ cd for some numerical constant c > 0, we should be unlikely to
make a mistake. Let us make this precise. We mistake word i for word k only if the received signal
Z satisfies kZ − xi k1 ≥ kZ − xk k1 , and letting J = {j ∈ [d] : xij 6= xkj } denote the set of at least
c · d indices where xi and xk differ, we have
X
kZ − xi k1 ≥ kZ − xk k1 if and only if |Zj − xij | − |Zj − xkj | ≥ 0.
j∈J

If xi is the word being sent and xi and xk differ in position j, then |Zj − xij | − |Zj − xkj | ∈ {−1, 1},
and is equal to −1 with probability (1 − ) and 1 with probability . That is, we have kZ − xi k1 ≥
kZ − xk k1 if and only if
X
|Zj − xij | − |Zj − xkj | + |J|(1 − 2) ≥ |J|(1 − 2) ≥ cd(1 − 2),
j∈J

and the expectation EQ [|Zj − xij | − |Zj − xkj | | xi ] = −(1 − 2) when xij 6= xkj . Using the Hoeffding
bound, then, we have

|J|(1 − 2)2 cd(1 − 2)2


   
Q(kZ − xi k1 ≥ kZ − xk k1 | xi ) ≤ exp − ≤ exp − ,
2 2

where we have used that there are at least |J| ≥ cd indices differing between xi and xk . The
probability of making a mistake at all is thus at most m exp(− 12 cd(1 − 2)2 ) if our codebook has
separation c · d.
For low error decoding to occur with extremely high probability, it is thus sufficient to choose
a set of code words {x1 , . . . , xm } that is well separated. To that end, we state a simple lemma.

75
Lexture Notes on Statistics and Information Theory John Duchi

Lemma 4.1.24. Let Xi , i = 1, . . . , m be drawn independently and uniformly on the d-dimensional


hypercube Hd := {0, 1}d . Then for any t ≥ 0,
 m2
   
d m
exp −2dt2 ≤ exp −2dt2 .

P ∃ i, j s.t. kXi − Xj k1 < − dt ≤
2 2 2
Proof First,
n let us consider
o two independent draws X and X 0 uniformly on the hypercube. Let
Pd
Z = j=1 1 Xj 6= Xj0 = dham (X, X 0 ) = kX − X 0 k1 . Then E[Z] = d2 . Moreover, Z is an i.i.d.
1
sum of Bernoulli random variables, so that by our concentration bounds of Corollary 4.1.10, we
2
have
2t2
   
0 d
P X − X 1 ≤ − t ≤ exp − .
2 d
Using a union bound gives the remainder of the result.

Rewriting the lemma slightly, we may take δ ∈ (0, 1). Then


r !
d 1
P ∃ i, j s.t. kXi − Xj k1 < − d log + d log m ≤ δ.
2 δ
As a consequence of this lemma, we see two things:
(i) If m ≤ exp(d/16), or d ≥ 16 log m, then taking δ ↑ 1, there at least exists a codebook
{x1 , . . . , xm } of words that are all separated by at least d/4, that is, kxi − xj k1 ≥ d4 for all
i, j.
(ii) By taking m ≤ exp(d/32), or d ≥ 32 log m, and δ = e−d/32 , then with probability at least
1 − e−d/32 —exponentially large in d—a randomly drawn codebook has all its entries separated
by at least kxi − xj k1 ≥ d4 .
Summarizing, we have the following result: choose a codebook of m codewords x1 , . . . , xm uniformly
at random from the hypercube Hd = {0, 1}d with
8 log m
 
δ
d ≥ max 32 log m, .
(1 − 2)2
Then with probability at least 1 − 1/m over the draw of the codebook, the probability we make a
mistake in transmission of any given symbol i over the channel Q is at most δ.

4.2 Martingale methods


The next set of tools we consider constitute our first look at argument sbased on stability, that is,
how quantities that do not change very much when a single observation changes should concentrate.
In this case, we would like to understand more general quantities than sample means, developing a
few of the basic cools to understand when functions f (X1 , . . . , Xn ) of independent random variables
Xi concentrate around their expectations. Roughly, we expect that if changing the value of one xi
does not significantly change f (xn1 ) much—it is stable—then it should exhibit good concentration
properties.
To develop the tools to do this, we go throuhg an approach based on martingales, a deep subject
in probability theory. We give a high-level treatment of martingales, taking an approach that does
not require measure-theoretic considerations, providing references at the end of the chapter. We
begin by providing a definition.

76
Lexture Notes on Statistics and Information Theory John Duchi

Definition 4.3. Let M1 , M2 , . . . be an R-valued sequence of random variables. They are a martin-
gale if there exist another sequence of random variables {Z1 , Z2 , . . .} ⊂ Z and sequence of functions
fn : Z n → R such that
E[Mn | Z1n−1 ] = Mn−1 and Mn = fn (Z1n )
for all n ∈ N. We say that the sequence Mn is adapted to {Zn }.

In general, the sequence Z1 , Z2 , . . . is a sequence of increasing σ-fields F1 , F2 , . . ., and Mn is Fn -


measurable, but Definition 4.3 is sufficienet for our purposes. We also will find it convenient to
study differences of martingales, so that we make the following

Definition 4.4. Let D1P


, D2 , . . . be a sequence of random variables. They form a martingale differ-
ence sequence if Mn := ni=1 Di is a martingale.

Equivalently, there is a sequence of random variables Zn and functions gn : Z n → R such that

E[Dn | Z1n−1 ] = 0 and Dn = gn (Z1n )

for all n ∈ N.
There are numerous examples of martingale sequences. The classical one is the symmetric
random walk.

Example 4.2.1: Let Dn ∈ {±1} be uniform and independent. Then Dn form a martingale
difference sequence adapted to themselves (that is, we may take Zn = Dn ), and Mn = ni=1 Di
P
is a martingale. 3

A more sophisticated example, to which we will frequently return and that suggests the potential
usefulness of martingale constructions, is the Doob martingale associated with a function f .

Example 4.2.2 (Doob martingales): Let f : X n → R be an otherwise arbitrary function,


and let X1 , . . . , Xn be arbitrary random variables. The Doob martingale is defined by the
difference sequence
Di := E[f (X1n ) | X1i ] − E[f (X1n ) | X1i−1 ].
By inspection, the Di are functions of X1i , and we have

E[Di | X1i−1 ] = E[E[f (X1n ) | X1i ] | X1i−1 ] − E[f (X1n ) | X1i−1 ]


= E[f (X1n ) | X1i−1 ] − E[f (X1n ) | X1i−1 ] = 0

by the tower property of expectations. Thus, the Di satisfy Definition 4.4 of a martingale
difference sequence, and moreover, we have
n
X
Di = f (X1n ) − E[f (X1n )],
i=1

and so the Doob martingale captures exactly the difference between f and its expectation. 3

77
Lexture Notes on Statistics and Information Theory John Duchi

4.2.1 Sub-Gaussian martingales and Azuma-Hoeffding inequalities


With these motivating ideas introduced, we turn to definitions, providing generalizations of our
concentration inequalities for sub-Gaussian sums to sub-Gaussian martingales, which we define.
Definition 4.5. Let {Dn } be a martingale difference sequence adapted to {Zn }. Then Dn is a
σn2 -sub-Gaussian martingale difference if
 2 2
n−1 λ σn
E[exp(λDn ) | Z1 ] ≤ exp
2
for all n and λ ∈ R.
Immediately from the definition, we have the Azuma-Hoeffding inequalities, which generalize
the earlier tensorization identities for sub-Gaussian random variables.
Theorem 4.2.3 (Azuma-Hoeffding).
Pn Pn Let {Dn } be a σn2 -sub-Gaussian martingale difference se-
2
quence. Then Mn = i=1 Di is i=1 σi -sub-Gaussian, and moreover,
nt2
 
max {P(Mn ≥ t), P(Mn ≤ −t)} ≤ exp − Pn for all t ≥ 0.
2 i=1 σi2
Proof The proof is essentially immediate: letting Zn be the sequence to which the Dn are
adapted, we write
" n #
Y
E[exp(λMn )] = E eλDi
i=1
" " n ##
Y
λDi n−1
=E E e | Z1
i=1
" "n−1 # #
Y
=E E eλDi | Z1n−1 E[eλDn | Z1n−1 ]
i=1

because D1 , . . . , Dn−1 are functions of Z1n−1 . Then we use Definition 4.5, which implies that
2 2
E[eλDn | Z1n−1 ] ≤ eλ σn /2 , and we obtain
"n−1 #  2 2
Y
λDi λ σn
E[exp(λMn )] ≤ E e exp .
2
i=1

Repeating the same argument for n − 1, n − 2, . . . , 1 gives that


n
λ2 X 2
log E[exp(λMn )] ≤ σi
2
i=1

as desired.
The second claims are simply applications of Chernoff bounds via Proposition 4.1.8 and that
E[Mn ] = 0.

As an immediate corollary, we recover


Pn Proposition 4.1.9, as sums of independent random vari-
ables form martingales via Mn = i=1 (Xi − E[Xi ]). A second corollary gives what is typically
termed the Azuma inequality:

78
Lexture Notes on Statistics and Information Theory John Duchi

Corollary 4.2.4. LetPDi be a bounded difference martingale difference sequence, meaning that
|Di | ≤ c. Then Mn = ni=1 Di satisfies

t2
 
−1/2 −1/2
P(n Mn ≥ t) ∨ P(n Mn ≤ −t) ≤ exp − 2 for t ≥ 0.
2c

Thus, bounded random walks are (with high probability) within ± n of their expectations after
n steps.
There exist extensions of these inequalities to the cases where we control the variance of the
martingales; see Freedman [87].

4.2.2 Examples and bounded differences


We now develop several example applications of the Azuma-Hoeffding inequalities (Theorem 4.2.3),
applying them most specifically to functions satisfying certain stability conditions.
We first define the collections of functions we consider.
Definition 4.6 (Bounded differences). Let f : X n → R for some space X . Then f satisfies
bounded differences with constants ci if for each i ∈ {1, . . . , n}, all xn1 ∈ X n , and x0i ∈ X we have
i−1 0
|f (xi−1 n n
1 , xi , xi+1 ) − f (x1 , xi , xi+1 )| ≤ ci .

The classical inequality relating bounded differences and concentration is McDiarmid’s inequal-
ity, or the bounded differences inequality.
Proposition 4.2.5 (Bounded differences inequality). Let f : X n → R satisfy bounded Pdifferences
with constants ci , and let Xi be independent random variables. f (X1n ) − E[f (X1n )] is 14 ni=1 c2i -sub-
Gaussian, and

2t2
 
n n n n
P (f (X1 ) − E[f (X1 )] ≥ t) ∨ P (f (X1 ) − E[f (X1 )] ≤ −t) ≤ exp − Pn 2 .
i=1 ci

Proof The basic idea is to show that the Doob martingale (Example 4.2.2) associated with f is
c2i /4-sub-Gaussian, and then to simply apply the Azuma-Hoeffding P inequality. To that end, define
Di = E[f (X1n ) | X1i ] − E[f (X1n ) | X1i−1 ] as before, and note that ni=1 Di = f (X1n ) − E[f (X1n )]. The
random variables

Li := inf E[f (X1n ) | X1i−1 , Xi = x] − E[f (X1n ) | X1i−1 ]


x
Ui := sup E[f (X1n ) | X1i−1 , Xi = x] − E[f (X1n ) | X1i−1 ]
x

evidently satisfy Li ≤ Di ≤ Ui , and moreover, we have

Ui − Li ≤ sup sup E[f (X1n ) | X1i−1 = x1i−1 , Xi = x] − E[f (X1n ) | X1i−1 = x1i−1 , Xi = x0 ]

xi−1 x,x0
1
Z
i−1 0 n
f (xi−1 n n

= sup sup 1 , x, xi+1 ) − f (x1 , x , xi+1 ) dP (xi+1 ) ≤ ci ,
xi−1 x,x0
1

where we have used the independence of the Xi and Definition 4.6 of bounded differences. Conse-
quently, we have by Hoeffding’s Lemma (Example 4.1.6) that E[eλDi | X1i−1 ] ≤ exp(λ2 c2i /8), that
is, the Doob martingale is c2i /4-sub-Gaussian.

79
Lexture Notes on Statistics and Information Theory John Duchi

The remainder of the proof is simply Theorem 4.2.3.

A number of quantities satisfy the conditions of Proposition 4.2.5, and we give two examples
here; we will revisit them more later.
Example 4.2.6 (Bounded random vectors): Let B be a Banach space—a complete normed
vector space—with norm k·k. Let Xi be independent bounded random vectors in B satisfying
E[Xi ] = 0 and kXi k ≤ c. We claim that the quantity
n
1X
f (X1n ) := Xi
n
i=1

satisfies bounded differences. Indeed, we have by the triangle inequality that

i−1 0 n 1 2c
|f (xi−1 n
1 , x, xi+1 ) − f (x1 , x , xi+1 )| ≤ x − x0 ≤ .
n n
Consequently, if Xi are indpendent, we have
n n
!
nt2
   
1X 1X
P Xi − E Xi ≥ t ≤ 2 exp − 2 (4.2.1)
n n 2c
i=1 i=1

for all t ≥ 0. That is, the norm of (bounded) random vectors in an essentially arbitrary vector
space concentrates extremely quickly about its expectation.
The challenge becomes to control the expectation term in the concentration bound (4.2.1),
which can be a bit challenging. In certain cases—for example, when we have a Euclidean
structure on the vectors Xi —it can be easier. Indeed, let us specialize to the case that Xi ∈ H,
a (real) Hilbert space, so that there is an inner product h·, ·i and the norm satisfies kxk2 = hx, xi
for x ∈ H. Then Cauchy-Schwarz implies that
 Xn 2  X n 2 X Xn
E Xi ≤E Xi = E[hXi , Xj i] = E[kXi k2 ].
i=1 i=1 i,j i=1

That is assuming the Xi are independent and E[kXi k2 ] ≤ σ 2 , inequality (4.2.1) becomes

nt2
     
σ σ
P X n ≥ √ + t + P X n ≤ − √ − t ≤ 2 exp − 2
n n 2c

where X n = n1 ni=1 Xi . 3
P

We can specialize Example 4.2.6 to a situation that is very important for treatments of concen-
tration, sums of random vectors, and generalization bounds in machine learning.
Example 4.2.7 (Rademacher complexities): This example is actually a special case of Ex-
ample 4.2.6, but its frequent uses justify a more specialized treatment and consideration. Let
X be some space, and let F be some collection of functions f : X → R. Let εi ∈ {−1, 1} be a
collection of independent random sign vectors. Then the empirical Rademacher complexity of
F is
n
" #
1 X
Rn (F | xn1 ) := E sup εi f (xi ) ,
n f ∈F i=1

80
Lexture Notes on Statistics and Information Theory John Duchi

where the expectation is over only the random signs


P εi . (In some cases, depending on context
and convenience, one takes the absolute value | i εi f (xi )|.) The Rademacher complexity of
F is
Rn (F) := E[Rn (F | X1n )],
the expectation of the empirical Rademacher complexities.
If f : X → [b0 , b1 ] for all f ∈ F, then the Rademacher complexity satisfies bounded differences,
because for any two sequences xn1 and z1n differing in only element j, we have
 n
X 
n|Rn (F | xn1 )−Rn (F | z1n )| ≤ E sup εi (f (xi )−f (zi )) = E[sup εi (f (xj )−f (zj ))] ≤ b1 −b0 .
f ∈F i=1 f ∈F

(b1 −b0 )2
Consequently, the empirical Rademacher complexity satisfies Rn (F | X1n ) − Rn (F) is 4n -
sub-Gaussian by Theorem 4.2.3. 3

These examples warrant more discussion, and it is possible to argue that many variants of these
random variables are well-concentrated. For example, instead of functions we may simply consider
an arbitrary set A ⊂ Rn and define the random variable
n
X
Z(A) := supha, εi = sup ai εi .
a∈A a∈A i=1

As a function of the random signs εi , we may write Z(A) = f (ε), and this is then a function
satisfying |f (ε) − f (ε0 )| ≤ supa∈A |ha, ε − ε0 i|, so that if ε and ε0 differ in index i, we have |f (ε) −
f (ε0 )| ≤ 2 supa∈A |ai |. That is, Z(A) − E[Z(A)] is ni=1 supa∈A |ai |2 -sub-Gaussian.
P

Example 4.2.8 (Rademacher complexity as a random vector): This view of Rademacher


complexity shows how we may think of Rademacher complexities as norms on certain spaces.
Indeed, if we consider a vector space L of linear functions on F, then we can define the F-
seminorm on L by kLkF := supf ∈F |L(f )|. In this case, we may consider the symmetrized
empirical distributions
n n
1X 1X
Pn0 := εi 1Xi f 7→ Pn0 f := εi f (Xi )
n n
i=1 i=1

as elements of this vector space L. (Here we have used 1Xi to denote the point mass at Xi .)
Then the Rademacher complexity is nothing more than the expected norm of Pn0 , a random
vector, as in Example 4.2.6. This view is somewhat sophisticated, but it shows that any general
results we may prove about random vectors, as in Example 4.2.6, will carry over immediately
to versions of the Rademacher complexity. 3

4.3 Uniformity and metric entropy


Now that we have explored a variety of concentration inequalities, we show how to put them to use
in demonstrating that a variety of estimation, learning, and other types of procedures have nice
convergence properties. We first give a somewhat general collection of results, then delve deeper
by focusing on some standard tasks from machine learning.

81
Lexture Notes on Statistics and Information Theory John Duchi

4.3.1 Symmetrization and uniform laws


The first set of results we consider are uniform laws of large numbers, where the goal is to bound
means uniformly over different classes of functions. Frequently, such results are called Glivenko-
Cantelli laws, after the original Glivenko-Cantelli theorem, which shows that empirical distributions
uniformly converge. We revisit these ideas in the next chapter, where we present a number of more
advanced techniques based on ideas of metric entropy (or volume-like considerations); here we
present the basic ideas using our stability and bounded differencing tools.
The starting point is to define what we mean by a uniform law of large numbers. To do so, we
adopt notation (as in Example 4.2.8) we will use throughout the remainder of the book, reminding
readers as we go. For a sample X1 , . . . , Xn on a space X , we let
n
1X
Pn := 1Xi
n
i=1

denote the empirical distribution on {Xi }ni=1 , where 1Xi denotes the point mass at Xi . Then for
functions f : X → R (or more generally, any function f defined on X ), we let
n
1X
Pn f := EPn [f (X)] = f (Xi )
n
i=1

denote the empirical expectation of f evaluated on the sample, and we also let
Z
P f := EP [f (X)] = f (x)dP (x)

denote general expectations under a measure P . With this notation, we study uniform laws of
large numbers, which consist of proving results of the form

sup |Pn f − P f | → 0, (4.3.1)


f ∈F

where convergence is in probability, expectation, almost surely, or with rates of convergence. When
we view Pn and P as (infinite-dimensional) vectors on the space of maps from F → R, then we
may define the (semi)norm k·kF for any L : F → R by

kLkF := sup |L(f )|,


f ∈F

in which case Eq. (4.3.1) is equivalent to proving

kPn − P kF → 0.

Thus, roughly, we are simply asking questions about when random vectors converge to their expec-
tations.1
The starting point of this investigation considers bounded random functions, that is, F consists
of functions f : X → [a, b] for some −∞ < a ≤ b < ∞. In this case, the bounded differences
inequality (Proposition 4.2.5) immediately implies that expectations of kPn − P kF provide strong
guarantees on concentration of kPn − P kF .
1
Some readers may worry about measurability issues here. All of our applications will be in separable spaces,
so that we may take suprema with abandon without worrying about measurability, and consequently we ignore this
from now on.

82
Lexture Notes on Statistics and Information Theory John Duchi

Proposition 4.3.1. Let F be as above. Then


2nt2
 
P (kPn − P kF ≥ E[kPn − P kF ] + t) ≤ exp − for t ≥ 0.
(b − a)2
Proof Let Pn and Pn0 be two empirical distributions, differing only in observation i (with Xi and
Xi0 ). We observe that

sup |Pn f − P f | − sup |Pn0 f − P f | ≤ sup |Pn f − P f | − |Pn0 f − P f |



f ∈F f ∈F f ∈F
1 b−a
≤ sup |f (Xi ) − f (Xi0 )| ≤
n f ∈F n

by the triangle inequality. An entirely parallel argument gives the converse lower bound of − b−a
n ,
and thus Proposition 4.2.5 gives the result.

Proposition 4.3.1 shows that, to provide control over high-probability concentration of kPn − P kF ,
it is (at least in cases where F is bounded) sufficient to control the expectation E[kPn − P kF ]. We
take this approach through the remainder of this section, developing tools to simplify bounding
this quantity.
Our starting points consist of a few inequalities relating expectations to symmetrized quantities,
which are frequently easier to control than their non-symmetrized parts. This symmetrization
technique is widely used in probability theory, theoretical statistics, and machine learning. The key
is that for centered random variables, symmetrized quantities have, to within numerical constants,
similar expectations to their non-symmetrized counterparts. Thus, in many cases, it is equivalent
to analyze the symmetized quantity and the initial quantity.
Proposition 4.3.2. Let Xi be independent random vectors on a (Banach) space with norm k·k
and let εi {−1, 1} be independent random signs. Then for any p ≥ 1,
" n # " n # " n #
X p X p X p
2−p E εi (Xi − E[Xi ]) ≤E (Xi − E[Xi ]) ≤ 2p E εi Xi
i=1 i=1 i=1

In the proof of the upper bound, we could also show the bound
" n # " n #
X p X p
E (Xi − E[Xi ]) ≤ 2p E εi (Xi − E[Xi ]) ,
i=1 i=1

so we may analyze whichever is more convenient.


Proof We prove the right bound first. We introduce independent copies of the Xi and use
these to symmetrize the quantity. Indeed, let Xi0 be an independent copy of Xi , and use Jensen’s
inequality and the convexity of k·kp to observe that
" n # " n # " n #
X p X p X p
0 0
E (Xi − E[Xi ]) =E (Xi − E[Xi ]) ≤E (Xi − Xi ) .
i=1 i=1 i=1

dist
Now, note that the distribution of Xi − Xi0 is symmetric, so that Xi − Xi0 = εi (Xi − Xi0 ), and thus
" n # " n #
X p X p
E (Xi − E[Xi ]) ≤E εi (Xi − Xi0 ) .
i=1 i=1

83
Lexture Notes on Statistics and Information Theory John Duchi

Multiplying and dividing by 2p , Jensen’s inequality then gives


" n # " n
#
p p
X
p 1X 0
E (Xi − E[Xi ]) ≤2 E εi (Xi − Xi )
2
i=1 i=1
" " n # " n ##
X p X p
≤2 p−1
E εi Xi +E εi Xi0
i=1 i=1

as desired.
For the left bound in the proposition, let Yi = Xi − E[Xi ] be the centered version of the random
variables. We break the sum over random variables into two parts, conditional on whether εi = ±1,
using repeated conditioning. We have
" n # " #
X p X X p
E εi Yi =E Yi − Yi
i=1 i:εi =1 i:ε=−1
" " # " ##
X p X p
≤ E 2p−1 E Yi | ε + 2p−1 E Yi |ε
i:εi =1 i:εi −1
" " # " ##
X X p X X p
p−1
=2 E E Yi + E[Yi ] |ε +E Yi + E[Yi ] |ε
i:εi =1 i:εi =−1 i:εi =−1 i:εi =1
" " # " ##
X X p X X p
p−1
≤2 E E Yi + Yi |ε +E Yi + Yi |ε
i:εi =1 i:εi =−1 i:εi =−1 i:εi =1
n
" #
X p
= 2p E Yi .
i=1

We obtain as an immediate corollary a symmetrization bound for supremum norms on function


spaces. In this corollary, we use the symmetrized empirical measure
n n
1X 1X
Pn0 := εi 1Xi , Pn0 f = εi f (Xi ).
n n
i=1 i=1

The expectation of Pn0 F is of course the Rademacher complexity (Examples 4.2.7 and 4.2.8), and
we have the following corollary.

Corollary 4.3.3. Let F be a class of functions f : X → R and Xi be i.i.d. Then E[kPn − P kF ] ≤


2E[kPn0 kF ].

From Corollary 4.3.3, it is evident that by controlling the expectation of the symmetrized process
E[kPn0 kF ] we can derive concentration inequalities and uniform laws of large numbers. For example,
we immediately obtain that

2nt2
 
0

P kPn − P kF ≥ 2E[kPn kF ] + t ≤ exp −
(b − a)2

84
Lexture Notes on Statistics and Information Theory John Duchi

for all t ≥ 0 whenever F consists of functions f : X → [a, b].


There are numerous examples of uniform laws of large numbers, many of which reduce to
developing bounds on the expectation E[kPn0 kF ], which is frequently possible via more advanced
techniques we develop in Chapter 6. A frequent application of these symmetrization ideas is to
risk minimization problems, as we discuss in the coming section; for these, it will be useful for us
to develop a few analytic and calculus tools. To better match the development of these ideas, we
return to the notation of Rademacher complexities, so that Rn (F) := E[ Pn0 F ]. The first is a
standard result, which we state for its historical value and the simplicity of its proof.
Proposition 4.3.4 (Massart’s finite class bound). Let F be any collection of functions with f :
X → R, and assume that σn2 := n−1 E[maxf ∈F ni=1 f (Xi )2 ] < ∞. Then
P
p
2σn2 log |F|
Rn (F) ≤ √ .
n
Proof For each fixed xn1 , the random variable ni=1 εi f (xi ) is ni=1 f (xi )2 -sub-Gaussian. Now,
P P
σ 2 (xn1 ) := n−1 maxf ∈F ni=1 f (xi )2 . Using the results of Exercise 4.7, that is, that E[maxj≤n Zj ] ≤
P
define
p
2σ 2 log n if the Zj are each σ 2 -sub-Gaussian, we see that
p
n 2σ 2 (xn1 ) log |F|
Rn (F | x1 ) ≤ √ .
n
√ p
Jensen’s inequality that E[ ·] ≤ E[·] gives the result.

A refinement of Massart’s finite class bound applies when the classes are infinite but, on a
collection X1 , . . . , Xn , the functions f ∈ F may take on only a (smaller) number of values. In this
case, we define the empirical shatter coefficient of a collection of points x1 , . . . , xn by SF (xn1 ) :=
card{(f (x1 ), . . . , f (xn )) | f ∈ F }, the number of distinct vectors of values (f (x1 ), . . . , f (xn )) the
functions f ∈ F may take. The shatter coefficient is the maximum of the empirical shatter coeffi-
cients over xn1 ∈ X n , that is, SF (n) := supxn1 SF (xn1 ). It is clear that SF (n) ≤ |F| always, but by
only counting distinct values, we have the following corollary.
Corollary 4.3.5 (A sharper variant of Massart’s finite class bound). Let F be any collection of
functions with f : X → R, and assume that σn2 := n−1 E[maxf ∈F ni=1 f (Xi )2 ] < ∞. Then
P
p
2σn2 log SF (n)
Rn (F) ≤ √ .
n
Typical classes with small shatter coefficients include Vapnik-Chervonenkis classes of functions; we
do not discuss these further here, instead referring to one of the many books in machine learning
and empirical process theory in statistics.
The most important of the calculus rules we use are the comparison inequalities for Rademacher
sums, which allow us to consider compositions of function classes and maintain small complexity
measurers. We state the rule here; the proof is complex, so we defer it to Section 4.5.3
Theorem 4.3.6 (Ledoux-Talagrand Contraction). Let T ⊂ Rn be an arbitrary set and let φi : R →
R be 1-Lipschitz and satisfy φi (0) = 0. Then for any nondecreasing convex function Φ : R → R+ ,
n
" !#   
1 X
E Φ sup φi (ti )εi ≤ E Φ supht, εi .
2 t∈T t∈T
i=1

85
Lexture Notes on Statistics and Information Theory John Duchi

A corollary to this theorem is suggestive of its power and applicability. Let φ : R → R be


L-Lipschitz, and for a function class F define φ ◦ F = {φ ◦ f | f ∈ F}. Then we have the following
corollary about Rademacher complexities of contractive mappings.
Corollary 4.3.7. Let F be an arbitrary function class and φ be L-Lipschitz. Then

Rn (φ ◦ F) ≤ 2LRn (F) + |φ(0)|/ n.
Proof The result is an almost immediate consequence of Theorem 4.3.6; we simply recenter our
functions. Indeed, we have
n n
" #
1 X 1 X
Rn (φ ◦ F | xn1 ) = E sup εi (φ(f (xi )) − φ(0)) + εi φ(0)
f ∈F n i=1 n
i=1
n n
" # " #
1X 1X
≤ E sup εi (φ(f (xi )) − φ(0)) + E εi φ(0)
f ∈F n i=1
n
i=1
|φ(0)|
≤ 2LRn (F) + √ ,
n
where the final inequality
P follows by Theorem 4.3.6 (as g(·) = φ(·) − φ(0) is Lipschitz and satisfies

g(0) = 0) and that E[| ni=1 εi |] ≤ n.

4.3.2 Metric entropy, coverings, and packings


When the class of functions F under consideration is finite, the union bound more or less provides
guarantees that Pn f is uniformly close to P f for all f ∈ F. When F is infinite, however, we require
a different set of tools for addressing uniform laws. In many cases, because of the application
of the bounded differences inequality in Proposition 4.3.1, all we really need to do is to control
the expectation E[kPn0 kF ], though the techniques we develop here will have broader use and can
sometimes directly guarantee concentration.
The basic object we wish to control is a measure of the size of the space on which we work.
To that end, we modify notation a bit to simply consider arbitrary vectors θ ∈ Θ, where Θ is a
non-empty set with an associated (semi)metric ρ. For many purposes in estimation (and in our
optimality results in the further parts of the book), a natural way to measure the size of the set is
via the number of balls of a fixed radius δ > 0 required to cover it.
Definition 4.7 (Covering number). Let Θ be a set with (semi)metric ρ. A δ-cover of the set Θ with
respect to ρ is a set {θ1 , . . . , θN } such that for any point θ ∈ Θ, there exists some v ∈ {1, . . . , N }
such that ρ(θ, θv ) ≤ δ. The δ-covering number of Θ is
N (δ, Θ, ρ) := inf {N ∈ N : there exists a δ-cover θ1 , . . . , θN of Θ} .
The metric entropy of the set Θ is simply the logarithm of its covering number log N (δ, Θ, ρ).
We can define a related measure—more useful for constructing our lower bounds—of size that
relates to the number of disjoint balls of radius δ > 0 that can be placed into the set Θ.
Definition 4.8 (Packing number). A δ-packing of the set Θ with respect to ρ is a set {θ1 , . . . , θM }
such that for all distinct v, v 0 ∈ {1, . . . , M }, we have ρ(θv , θv0 ) ≥ δ. The δ-packing number of Θ is
M (δ, Θ, ρ) := sup {M ∈ N : there exists a δ-packing θ1 , . . . , θM of Θ} .

86
Lexture Notes on Statistics and Information Theory John Duchi

Figure 4.1. A δ-covering of the


elliptical set by balls of radius δ.

Figure 4.2. A δ-packing of the


elliptical set, where balls have ra-
dius δ/2. No balls overlap, and
δ/2 each center of the packing satisfies
kθv − θv0 k ≥ δ.

δ/2

Figures 4.1 and 4.2 give examples of (respectively) a covering and a packing of the same set.
An exercise in proof by contradiction shows that the packing and covering numbers of a set are
in fact closely related:
Lemma 4.3.8. The packing and covering numbers satisfy the following inequalities:
M (2δ, Θ, ρ) ≤ N (δ, Θ, ρ) ≤ M (δ, Θ, ρ).
We leave derivation of this lemma to Exercise 4.11, noting that it shows that (up to constant factors)
packing and covering numbers have the same scaling in the radius δ. As a simple example, we see
for any interval [a, b] on the real line that in the usual absolute distance metric, N (δ, [a, b], | · |) 
(b − a)/δ.
As one example of the metric entropy, consider a set of functions F with reasonable covering
numbers (metric entropy) in k·k∞ -norm.
Example 4.3.9 (The “standard” covering number guarantee): Let F consist of functions
f : X → [−b, b] and let the metric ρ be kf − gk∞ = supx∈X |f (x) − g(x)|. Then
!
nt2
 
P sup |Pn f − P f | ≥ t ≤ exp − + log N (t/3, F, k·k∞ ) . (4.3.2)
f ∈F 18b2

87
Lexture Notes on Statistics and Information Theory John Duchi

So as long as the covering numbers N (t, F, k·k∞ ) grow sub-exponentially in t—so that log N (t) 
nt2 —we have the (essentially) sub-Gaussian tail bound (4.3.2). Example 4.4.11 gives one typ-
ical case. Indeed, fix a minimal t/3-cover of F in k·k∞ of size N := N (t/3, F, k·k∞ ), call-
ing the covering functions f1 , . . . , fN . Then for any f ∈ F and the function fi satisfying
kf − fi k∞ ≤ t/2, we have
2t
|Pn f − P f | ≤ |Pn f − Pn fi | + |Pn fi − P fi | + |P fi − P f | ≤ |Pn fi − P fi | + .
3
The Azuma-Hoeffding inequality (Theorem 4.2.3) guarantees (by a union bound) that
nt2
   
P max |Pn fi − P fi | ≥ t ≤ exp − 2 + log N .
i≤N 2b
Combine this bound (replacing t with t/3) to obtain inequality (4.3.2). 3
Given the relationships between packing, covering, and size of sets Θ, we would expect there
to be relationships between volume, packing, and covering numbers. This is indeed the case, as we
now demonstrate for arbitrary norm balls in finite dimensions.
Lemma 4.3.10. Let B denote the unit k·k-ball in Rd . Then
 d
2 d
 
1
≤ N (δ, B, k·k) ≤ 1 + .
δ δ
Proof We prove the lemma via a volumetric argument. For the lower bound, note that if the
points v1 , . . . , vN are a δ-cover of B, then
N
X
Vol(B) ≤ Vol(δB + vi ) = N Vol(δB) = N Vol(B)δ d .
i=1

In particular, N ≥ δ −d . For the upper bound on N (δ, B, k·k), let V be a δ-packing of B with
maximal cardinality, so that |V| = M (δ, B, k·k) ≥ N (δ, B, k·k) (recall Lemma 4.3.8). Notably, the
collection of δ-balls {δB + vi }M
i=1 cover the ball B (as otherwise, we could put an additional element
in the packing V), and moreover, the balls { 2δ B + vi } are all disjoint by definition of a packing.
Consequently, we find that
 d
δ d
     
δ δ δ
M Vol(B) = M Vol B ≤ Vol B + B = 1 + Vol(B).
2 2 2 2
Rewriting, we obtain
 d 
δ d Vol(B) 2 d
  
2
M (δ, B, k·k) ≤ 1+ = 1+ ,
δ 2 Vol(B) δ
completing the proof.

Let us give one application of Lemma 4.3.10 to concentration of random matrices; we explore
more in the exercises as well. We can generalize the definition of sub-Gaussian random variables
to sub-Gaussian random vectors, where we say that X ∈ Rd is a σ 2 -sub-Gaussian vector if
 2 
σ 2
E[exp(hu, X − E[X]i)] ≤ exp kuk2 (4.3.3)
2

88
Lexture Notes on Statistics and Information Theory John Duchi

for all u ∈ Rd . For example, X ∼ N(0, Id ) is immediately 1-sub-Gaussian, and X ∈ [−b, b]d with
independent entries is b2 -sub-Gaussian. Now, suppose that Xi are independent isotropic random
vectors, meaning that E[Xi ] = 0, E[Xi Xi> ] = Id , and that they are also σ 2 -sub-Gaussian. Then by
an application of Lemma 4.3.10, we can give concentration guarantees for the sample covariance
Σn := n1 ni=1 Xi Xi> for the operator norm kAkop := sup{hu, Avi | kuk2 = kvk2 = 1}.
P

Proposition 4.3.11. Let Xi be independent isotropic and σ 2 -sub-Gaussian vectors. Then there is
a numerical constant C such that the sample covariance Σn := n1 ni=1 Xi Xi> satisfies
P
 s 
1 1
d + log d + log
kΣn − Id kop ≤ Cσ 2  δ
+ δ
n n

with probability at least 1 − δ.


Proof We begin with an intermediate lemma.
Lemma 4.3.12. Let A be symmetric and {ui }N d
i=1 be an -cover of the unit `2 ball B2 . Then

(1 − 2) kAkop ≤ maxhui , Aui i ≤ kAkop .


i≤N

Proof The second inequality is trivial. Fix any u ∈ Bd2 . Then for the i such that ku − ui k2 ≤ ,
we have
hu, Aui = hu − ui , Aui + hui , Aui = 2hu − ui , Aui + hui , Aui i ≤ 2 kAkop + hui , Aui i
by definition of the operator norm. Taking a supremum over u gives the final result.

Let the matrix Ei = Xi Xi> − I, and define the average error E n = n1 Ei . Then with this lemma
in hand, we see that for any -cover N of the `2 -ball Bd2 ,
(1 − 2) E n op
≤ maxhu, E n ui.
u∈N

Now, note that hu, Ei ui = hu, Xi i2 −kuk22


is sub-exponential, as it is certainly mean 0 and, moreover,
is the square of a sub-Gaussian; in particular, Theorem 4.1.15 shows that there is a numerical
constant C < ∞ such that
1
E[exp(λhu, Ei ui)] ≤ exp Cλ2 σ 4 for |λ| ≤

.
Cσ 2
1
Taking  = 4 in our covering N , then,
 

P( E n op ≥ t) ≤ P maxhu, E n ui ≥ t/2 ≤ |N | · max P hu, nE n ui ≥ nt/2
u∈N u∈N

by a union bound. As sums of sub-exponential random variable remain sub-exponential, Corol-


lary 4.1.18 implies
  2 
  nt nt
P E n op ≥ t ≤ |N | exp −c min , ,
σ4 σ2
where c > 0 is a numerical constant. Finally, we apply Lemma 4.3.10, q which guarantees that
d 2 d+log 1δ 2 d+log 1δ
|N | ≤ 9 , and then take t to scale as the maximum of σ n and σ n .

89
Lexture Notes on Statistics and Information Theory John Duchi

4.4 Generalization bounds


We now build off of our ideas on uniform laws of large numbers and Rademacher complexities to
demonstrate their applications in statistical machine learning problems, focusing on empirical risk
minimization procedures and related problems. We consider a setting as follows: we have a sample
Z1 , . . . , Zn ∈ Z drawn i.i.d. according to some (unknown) distribution P , and we have a collection
of functions F from which we wish to select an f that “fits” the data well, according to some loss
measure ` : F × Z → R. That is, we wish to find a function f ∈ F minimizing the risk

L(f ) := EP [`(f, Z)]. (4.4.1)

In general, however, we only have access to the risk via the empirical distribution of the Zi , and
we often choose f by minimizing the empirical risk
n
b n (f ) := 1
X
L `(f, Zi ). (4.4.2)
n
i=1

As written, this formulation is quite abstract, so we provide a few examples to make it somewhat
more concrete.
Example 4.4.1 (Binary classification problems): One standard problem—still abstract—
that motivates the formulation (4.4.1) is the binary classification problem. Here the data Zi
come in pairs (X, Y ), where X ∈ X is some set of covariates (independent variables) and
Y ∈ {−1, 1} is the label of example X. The function class F consists of functions f : X → R,
and the goal is to find a function f such that

P(sign(f (X)) 6= Y )

is small, that is, minimizing the risk E[`(f, Z)] where the loss is the 0-1 loss, `(f, (x, y)) =
1 {f (x)y ≤ 0}. 3

Example 4.4.2 (Multiclass classification): The multiclass classifcation problem is identical


to the binary problem, but instead of Y ∈ {−1, 1} we assume that Y ∈ [k] = {1, . . . , k} for
some k ≥ 2, and the function class F consists of (a subset of) functions f : X → Rk . The
goal is to find a function f such that, if Y = y is the correct label for a datapoint x, then
fy (x) > fl (x) for all l 6= y. That is, we wish to find f ∈ F minimizing

P (∃ l 6= Y such that fl (X) ≥ fY (X)) .

In this case, the loss function is the zero-one loss `(f, (x, y)) = 1 {maxl6=y fl (x) ≥ fy (x)}. 3

Example 4.4.3 (Binary classification with linear functions): In the standard statistical
learning setting, the data x belong to Rd , and we assume that our function class F is indexed
by a set Θ ⊂ Rd , so that F = {fθ : fθ (x) = θ> x, θ ∈ Θ}. In this case, we may use the zero-one
loss,
 the convex hinge loss, or the (convex) logistic loss, which are variously `zo (fθ , (x, y)) :=
>
1 yθ x ≤ 0 , and the convex losses
h i
`hinge (fθ , (x, y)) = 1 − yx> θ and `logit (fθ , (x, y)) = log(1 + exp(−yx> θ)).
+

The hinge and logistic losses, as they are convex, are substantially computationally easier to
work with, and they are common choices in applications. 3

90
Lexture Notes on Statistics and Information Theory John Duchi

The main motivating question that we ask is the following: given a sample Z1 , . . . , Zn , if we
choose some fbn ∈ F based on this sample, can we guarantee that it generalizes to unseen data? In
particular, can we guarantee that (with high probability) we have the empirical risk bound
n
b n (fbn ) = 1
X
L `(fbn , Zi ) ≤ R(fbn ) +  (4.4.3)
n
i=1

for some small ? If we allow fbn to be arbitrary, then this becomes clearly impossible: consider the
classification example 4.4.1, and set fbn to be the “hash” function that sets fbn (x) = y if the pair
(x, y) was in the sample, and otherwise fbn (x) = −1. Then clearly L b n (fbn ) = 0, while there is no
useful bound on R(fbn ).

4.4.1 Finite and countable classes of functions


In order to get bounds of the form (4.4.3), we require a few assumptions that are not too onerous.
First, throughout this section, we will assume that for any fixed function f , the loss `(f, Z) is
σ 2 -sub-Gaussian, that is,
 2 2
λ σ
EP [exp (λ(`(f, Z) − L(f )))] ≤ exp (4.4.4)
2

for all f ∈ F. (Recall that the risk functional L(f ) = EP [`(f, Z)].) For example, if the loss is the
zero-one loss from classification problems, inequality (4.4.4) is satisfied with σ 2 = 14 by Hoeffding’s
lemma. In order to guarantee a bound of the form (4.4.4) for a function fb chosen dependent on
the data, in this section we give uniform bounds, that is, we would like to bound
!
 
P there exists f ∈ F s.t. L(f ) > Lb n (f ) + t or P sup L b n (f ) − R(f ) > t .
f ∈F

Such uniform bounds are certainly sufficient to guarantee that the empirical risk is a good proxy
for the true risk L, even when fbn is chosen based on the data.
Now, recalling that our set of functions or predictors F is finite or countable, let us suppose
that for each f ∈ F, we have a complexity measure c(f )—a penalty—such that
X
e−c(f ) ≤ 1. (4.4.5)
f ∈F

This inequality should look familiar to the Kraft inequality—which we will see in the coming
chapters—from coding theory. As soon as we have such a penalty function, however, we have the
following result.

Theorem 4.4.4. Let the loss `, distribution P on Z, and function class F be such that `(f, Z) is
σ 2 -sub-Gaussian for each f ∈ F, and assume that the complexity inequality (4.4.5) holds. Then
with probability at least 1 − δ over the sample Z1:n ,
s
1
b n (f ) + 2σ 2 log δ + c(f ) for all f ∈ F.
L(f ) ≤ L
n

91
Lexture Notes on Statistics and Information Theory John Duchi

Proof First, we note that by the usual sub-Gaussian concentration inequality (Corollary 4.1.10)
we have for any t ≥ 0 and any f ∈ F that
2
 
  nt
P L(f ) ≥ L b n (f ) + t ≤ exp − .
2σ 2
p
Now, if we replace t by t2 + 2σ 2 c(f )/n, we obtain
nt2
   
p
2 2
P L(f ) ≥ Ln (f ) + t + 2σ c(f )/n ≤ exp − 2 − c(f ) .
b

Then using a union bound, we have
nt2
  X  
p
2 2
P ∃ f ∈ F s.t. L(f ) ≥ Ln (f ) + t + 2σ c(f )/n ≤
b exp − 2 − c(f )

f ∈F
nt2 X
 
= exp − 2 exp(−c(f )) .

f ∈F
| {z }
≤1

Setting t2 = 2σ 2 log 1δ /n gives the result.

As one classical example of this setting, suppose that we have a finite class of functions F. Then
we can set c(f ) = log |F|, in which case we clearly have the summation guarantee (4.4.5), and we
obtain s
1
L(f ) ≤ Lb n (f ) + 2σ 2 log δ + log |F| uniformly for f ∈ F
n
with probability at least 1 − δ. To make this even more concrete, consider the following example.
Example 4.4.5 (Floating point classifiers): We implement a linear binary classifier using
double-precision floating point values, that is, we have fθ (x) = θ> x for all θ ∈ Rd that may
be represented using d double-precision floating point numbers. Then for each coordinate of
θ, there are at most 264 representable numbers;
 > in total, we must thus have |F| ≤ 264d . Thus,
for the zero-one loss `zo (fθ , (x, y)) = 1 θ xy ≤ 0 , we have
s
1
b n (fθ ) + log δ + 45d
L(fθ ) ≤ L
2n
for all representable classifiers simultaneously, with probability at least 1 − δ, as the zero-one
loss is 1/4-sub-Gaussian. (Here we have used that 64 log 2 < 45.) 3
We also note in passing that by replacing δ with δ/2 in the bounds of Theorem 4.4.4, a union
bound yields the following two-sided corollary.
Corollary 4.4.6. Under the conditions of Theorem 4.4.4, we have
s
2
Lb n (f ) − L(f ) ≤ 2σ 2 log δ + c(f ) for all f ∈ F
n
with probability at least 1 − δ.

92
Lexture Notes on Statistics and Information Theory John Duchi

4.4.2 Large classes


When the collection of functions is (uncountably) infinite, it can be more challenging to obtain
strong generalization bounds, though there still exist numerous tools for these ideas. The most
basic, of which we will give examples, leverage covering number bounds (essentially, as in Exam-
ple 4.3.9). We return in the next chapter to alternative approaches based on randomization and
divergence measures, which provide guarantees with somewhat similar structure to those we present
here.
Let us begin by considering a few examples, after which we provide examples showing how to
derive explicit bounds using Rademacher complexities.

Example 4.4.7 (Rademacher complexity of the `2 -ball): Let Θ = {θ ∈ Rd | kθk2 ≤ r}, and
consider the class of linear functionals F := {fθ (x) = θT x, θ ∈ Θ}. Then
v
u n
ru X
n
Rn (F | x1 ) ≤ t kxi k22 ,
n
i=1

because we have
v " v
n n u n
" # #
u 2
r X ru X ru X
Rn (F | xn1 ) = E εi x i ≤ t E εi x i = t kxi k22 ,
n 2 n 2 n
i=1 i=1 i=1

as desired. 3

In high-dimensional situations, it is sometimes useful to consider more restrictive function


classes, for example, those indexed by vectors in an `1 -ball.

Example 4.4.8 (Rademacher complexity of the `1 -ball): In contrast to the previous example,
suppose that Θ = {θ ∈ Rd | kθk1 ≤ r}, and consider the linear class F := {fθ (x) = θT x, θ ∈ Θ}.
Then
" n #
r X
Rn (F | xn1 ) = E εi x i .
n ∞ i=1

Now, each coordinate j of ni=1 εi xi is ni=1 x2ij -sub-Gaussian, and thus using that E[maxj≤d Zj ] ≤
P P
p
2σ 2 log d for arbitrary σ 2 -sub-Gaussian Zj (see Exercise 4.7), we have
v
u n
n r u X
Rn (F | x1 ) ≤ t2 log(2d) max x2ij .
n j
i=1

To facilitate comparison with Example 4.4.8, suppose that the vectors


p xi all satisfy kxi k∞ ≤ b.
n √
√ Rn (F | x1 ) ≤ rb 2 log(2d)/ n. In contrast,
In this case, the preceding inequality implies that
the `2 -norm √ of such xi may satisfy kxi k2 = b d, so that the bounds of Example 4.4.7 scale

instead as rb d/ n, which can be exponentially larger. 3

These examples are sufficient to derive a few sophisticated risk bounds. We focus on the case
where we have a loss function applied to some class with reasonable Rademacher complexity, in

93
Lexture Notes on Statistics and Information Theory John Duchi

which case it is possible to recenter the loss class and achieve reasonable complexity bounds. The
coming proposition does precisely this in the case of margin-based binary classification. Consider
points (x, y) ∈ X × {±1}, and let F be an arbitrary class of functions f : X → R and L =
{(x, y) 7→ `(yf (x))}f ∈F be the induced collection of losses. As a typical example, we might have
`(t) = [1 − t]+ , `(t) = e−t , or `(t) = log(1 + e−t ). We have the following proposition.

Proposition 4.4.9. Let F and X be such that supx∈X |f (x)| ≤ M for f ∈ F and assume that
` is L-Lipschitz. Define the empirical and population risks L b n (f ) := Pn `(Y f (X)) and L(f ) :=
P `(Y f (X)). Then
!
2
 
P sup |L b n (f ) − L(f )| ≥ 4LRn (F) + t ≤ 2 exp − nt for t ≥ 0.
f ∈F 2L2 M 2

Proof We may recenter the class L, that is, replace `(·) with `(·) − `(0), without changing
b n (f ) − L(f ). Call this class L0 , so that kPn − P k = kPn − P k . This recentered class satisfies
L L L0
bounded differences with constant 2M L, as |`(yf (x)) − `(y 0 f (x0 ))| ≤ L|yf (x) − y 0 f (x0 )| ≤ 2LM ,
as in the proof of Proposition 4.3.1. Applying Proposition 4.3.1 and then Corollary 4.3.3 and gives
that P(supf ∈F |L b n (f ) − L(f )| ≥ 2Rn (L0 ) + t) ≤ exp(− nt22 2 ) for t ≥ 0. Then applying the con-
2M L
traction inequality (Theorem 4.3.6) yields Rn (L0 ) ≤ 2LRn (F), giving the result.

Let us give a few example applications of these ideas.

Example 4.4.10 (Support vector machines and hinge losses): In the support vector machine
problem, we receive data (Xi , Yi ) ∈ Rd × {±1}, and we seek to minimize average of the losses
`(θ; (x, y)) = 1 − yθT x + . We assume that the space X has kxk2 ≤ b for x ∈ X and that


Θ = {θ ∈ Rd | kθk2 ≤ r}. Applying Proposition 4.4.9 gives

nt2
   
P sup |Pn `(θ; (X, Y )) − P `(θ; (X, Y ))| ≥ 4Rn (FΘ ) + t ≤ exp − 2 2 ,
θ∈Θ 2r b

where FΘ = {fθ (x) = θT x}θ∈Θ . Now, we apply Example 4.4.7, which implies that

2rb
Rn (φ ◦ FΘ ) ≤ 2Rn (Fθ ) ≤ √ .
n

That is, we have

nt2
   
4rb
P sup |Pn `(θ; (X, Y )) − P `(θ; (X, Y ))| ≥ √ + t ≤ exp − ,
θ∈Θ n 2(rb)2

so that Pn and P become close at rate roughly rb/ n in this case. 3

Example 4.4.10 is what is sometimes called a “dimension free” convergence result—there is no


esxplicit dependence on the dimension d of the problem, except as the radii r and b make explicit.
One consequence of this is that if x and θ instead belong to a Hilbert space (potentiall infinite
dimensional) with inner product h·, ·i and norm kxk2 = hx, xi, but for which we are guaranteed
that kθk ≤ r and similarly kxk ≤ b, then the result still applies. Extending this to other function
classes is reasonably straightforward, and we present a few examples in the exercises.

94
Lexture Notes on Statistics and Information Theory John Duchi

When we do not have the simplifying structure of `(yf (x)) identified in the preceding examples,
we can still provide guarantees of generalization using the covering number guarantees introduced
in Section 4.3.2. The most common and important case is when we have a Lipschitzian loss function
in an underlying parameter θ.
Example 4.4.11 (Lipschitz functions over a norm-bounded parameter space): Consider the
parametric loss minimization problem

minimize L(θ) := E[`(θ; Z)]


θ∈Θ

for a loss function ` that is M -Lipschitz (with respect to the norm k·k) in its argument, where
for normalization we assume inf θ∈Θ `(θ, z) = 0 for each z. Then the metric entropy of Θ
bounds the metric entropy of the loss class F := {z 7→ `(θ, z)}θ∈Θ for the supremum norm
k·k∞ . Indeed, for any pair θ, θ0 , we have

sup |`(θ, z) − `(θ0 , z)| ≤ M θ − θ0 ,


z

and so an -cover of Θ is an M -cover of F in supremum norm. In particular,

N (, F, k·k∞ ) ≤ N (/M, Θ, k·k).

Assume that Θ ⊂ {θ | kθk ≤ b} for some finite b. Then Lemma 4.3.10 guarantees that
log N (, Θ, k·k) ≤ d log(1 + 2/) . d log 1 , and so the classical covering number argument in
Example 4.3.9 gives
nt2
   
M
P sup |Pn `(θ, Z) − P `(θ, Z)| ≥ t ≤ exp −c 2 2 + Cd log ,
θ∈Θ b M t
2 2d
where c, C are numerical constants. In particular, taking t2  M nb log nδ gives that

M b d log nδ
p
|Pn `(θ, Z) − P `(θ, Z)| . √
n
with probability at least 1 − δ. 3

4.4.3 Structural risk minimization and adaptivity


In general, for a given function class F, we can always decompose the excess risk into the approxi-
mation/estimation error decomposition. That is, let

L∗ = inf L(f ),
f

where the preceding infimum is taken across all (measurable) functions. Then we have

L(fbn ) − L∗ = L(fbn ) − inf L(f ) + inf L(f ) − L∗ . (4.4.6)


f ∈F f ∈F
| {z } | {z }
estimation approximation
There is often a tradeoff between these two, analogous to the bias/variance tradeoff in classical
statistics; if the approximation error is very small, then it is likely hard to guarantee that the esti-
mation error converges quickly to zero, while certainly a constant function will have low estimation

95
Lexture Notes on Statistics and Information Theory John Duchi

error, but may have substantial approximation error. With that in mind, we would like to develop
procedures that, rather than simply attaining good performance for the class F, are guaranteed
to trade-off in an appropriate way between the two types of error. This leads us to the idea of
structural risk minimization.
In this scenario, we assume we have a sequence of classes of functions, F1 , F2 , . . ., of increasing
complexity, meaning that F1 ⊂ F2 ⊂ . . .. For example, in a linear classification setting with
vectors x ∈ Rd , we might take a sequence of classes allowing increasing numbers of non-zeros in
the classification vector θ:
n o n o
F1 := fθ (x) = θ> x such that kθk0 ≤ 1 , F2 := fθ (x) = θ> x such that kθk0 ≤ 2 , . . . .

More broadly, let {Fk }k∈N be a (possibly infinite) increasing sequence of function classes. We
assume that for each Fk and each n ∈ N, there exists a constant Cn,k (δ) such that we have the
uniform generalization guarantee
!
P sup L b n (f ) − L(f ) ≥ Cn,k (δ) ≤ δ · 2−k .
f ∈Fk

For example, by Corollary 4.4.6, if F is finite we may take


s
log |Fk | + log 1δ + k log 2
Cn,k (δ) = 2σ 2 .
n

(We will see in subsequent sections of the course how to obtain other more general guarantees.)
We consider the following structural risk minimization procedure. First, given the empirical
risk L
b n , we find the model collection b
k minimizing the penalized risk
 
k := argmin inf Ln (f ) + Cn,k (δ) .
b b (4.4.7a)
k∈N f ∈Fk

We then choose fb to minimize the risk over the estimated “best” class Fbk , that is, set

fb := argmin L
b n (f ). (4.4.7b)
f ∈Fkb

With this procedure, we have the following theorem.

Theorem 4.4.12. Let fb be chosen according to the procedure (4.4.7a)–(4.4.7b). Then with proba-
bility at least 1 − δ, we have

L(fb) ≤ inf inf {L(f ) + 2Cn,k (δ)} .


k∈N f ∈Fk

Proof First, we have by the assumed guarantee on Cn,k (δ) that


!
P ∃ k ∈ N and f ∈ Fk such that sup L
b n (f ) − L(f ) ≥ Cn,k (δ)
f ∈Fk
∞ ∞
!
X X
≤ P ∃ f ∈ Fk such that sup L
b n (f ) − L(f ) ≥ Cn,k (δ) ≤ δ · 2−k = δ.
k=1 f ∈Fk k=1

96
Lexture Notes on Statistics and Information Theory John Duchi

On the event that supf ∈Fk |L b n (f ) − L(f )| < Cn,k (δ) for all k, which occurs with probability at least
1 − δ, we have
n o
L(fb) ≤ L
b n (f ) + C b (δ) = inf inf L
n,k
b n (f ) + Cn,k (δ) ≤ inf inf {L(f ) + 2Cn,k (δ)}
k∈N f ∈Fk k∈N f ∈Fk

by our choice of fb. This is the desired result.

We conclude with a final example, using our earlier floating point bound from Example 4.4.5,
coupled with Corollary 4.4.6 and Theorem 4.4.12.
Example 4.4.13 (Structural risk minimization with floating point classifiers): Consider
again our floating point example, and let the function class Fk consist of functions defined by
at most k double-precision floating point values, so that log |Fk | ≤ 45d. Then by taking
s
log 1δ + 65k log 2
Cn,k (δ) =
2n
we have that |Lb n (f )−L(f )| ≤ Cn,k (δ) simultaneously for all f ∈ Fk and all Fk , with probability
at least 1 − δ. Then the empirical risk minimization procedure (4.4.7) guarantees that
 s 
1
 2 log δ + 91k 
L(fb) ≤ inf inf L(f ) + .
k∈N f ∈Fk n 

Roughly, we trade between small risk L(f )—as the qrisk inf f ∈Fk L(f ) must be decreasing in
k—and the estimation error penalty, which scales as (k + log 1δ )/n. 3

4.5 Technical proofs


4.5.1 Proof of Theorem 4.1.11
(1) implies (2) Let K1 = 1. Using the change Rof variables identity that for a nonnegative

random variable Z and any k ≥ 1 we have E[Z k ] = k 0 tk−1 P(Z ≥ t)dt, we find
Z ∞ Z ∞  2 Z ∞
t
k
E[|X| ] = k k−1
t P(|X| ≥ t)dt ≤ 2k tk−1
exp − 2 dt = kσ k uk/2−1 e−u du,
0 0 σ 0

where for the last inequality we made the substitution u = t2 /σ 2 . Noting that this final integral is
Γ(k/2), we have E[|X|k ] ≤ kσ k Γ(k/2). Because Γ(s) ≤ ss for s ≥ 1, we obtain
p √
E[|X|k ]1/k ≤ k 1/k σ k/2 ≤ e1/e σ k.
Thus (2) holds with K2 = e1/e .

1 k
(2) implies (3) Let σ = kXkψ2 = supk≥1 k − 2 E[|X|k ]1/k , so that K2 = 1 and E[|X|k ] ≤ k 2 σ for
all k. For K3 ∈ R+ , we thus have
∞ ∞ ∞ 
E[X 2k ] σ 2k (2k)k (i) X 2e k
X X 
E[exp(X 2 /(K3 σ 2 ))] = ≤ ≤
k!K32k σ 2k k!K32k σ 2k K32
k=0 k=0 k=0

where inequality (i) follows because k! ≥ (k/e)k , or 1/k! ≤ (e/k)k . Noting that ∞ k 1
P
p k=0 α = 1−α ,
we obtain (3) by taking K3 = e 2/(e − 1) ≈ 2.933.

97
Lexture Notes on Statistics and Information Theory John Duchi

(3) implies (4) Let us take K3 = 1. We claim that (4) holds with K4 = 34 . We prove this
result for both small and large λ. First, note the (highly non-standard, but true!) inequality that
9x2
ex ≤ x + e 16 for all x. Then we have
9λ2 X 2
  
E[exp(λX)] ≤ E[λX] +E exp
| {z } 16
=0
4
Now note that for |λ| ≤ 3σ ,
we have 9λ2 σ 2 /16
≤ 1, and so by Jensen’s inequality,
  2 2   
9λ X 2
2 2
2 9λ16σ 9λ2 σ 2
E exp = E exp(X /σ ) ≤ e 16 .
16
λ2 cx2
For large λ, we use the simpler Fenchel-Young inequality, that is, that λx ≤ 2c + 2 , valid for all
c ≥ 0. Then we have for any 0 ≤ c ≤ 2 that
  2 
λ2 σ 2 cX λ2 σ 2 c
E[exp(λX)] ≤ e 2c E exp 2
≤ e 2c e 2 ,

4 1 9 2 2
where the final inequality follows from Jensen’s inequality. If |λ| ≥ 3σ , then 2 ≤ 32 λ σ , and we
have  2 2
1
[ 2c 9c 2 2
+ 32 ]λ σ 3λ σ
E[exp(λX)] ≤ inf e = exp .
c∈[0,2] 4

1
(4) implies (1) This is the content of Proposition 4.1.8, with K4 = 2 and K1 = 2.

4.5.2 Proof of Theorem 4.1.15


(1) implies (2) As Rin the proof of Theorem 4.1.11, we use that for a nonnegative random variable

Z we have E[Z k ] = k 0 tk−1 P(Z ≥ t)dt. Let K1 = 1. Then
Z ∞ Z ∞ Z ∞
k k−1 k−1 k
E[|X| ] = k t P(|X| ≥ t)dt ≤ 2k t exp(−t/σ)dt = 2kσ uk−1 exp(−u)du,
0 0 0

where we used the substitution u = t/σ. Thus we have E[|X|k ] ≤ 2Γ(k + 1)σ k , and using Γ(k + 1) ≤
k k yields E[|X|k ]1/k ≤ 21/k kσ, so that (2) holds with K2 ≤ 2.

(2) implies (3) Let K2 = 1, and note that


∞ ∞ ∞ 
E[X k ] k k 1 (i) X e k
X X 
E[exp(X/(K3 σ))] = k σ k k!
≤ · ≤ ,
K 3 k! K3k K3
k=0 k=0 k=0

where inequality (i) used that k! ≥ (k/e)k . Taking K3 = e2 /(e − 1) < 5 gives the result.

(3) implies (1) If E[exp(X/σ)] ≤ e, then for t ≥ 0


P(X ≥ t) ≤ E[exp(X/σ)]e−t/σ ≤ e1−t/σ .
With the same result for the negative tail, we have
2t
P(|X| ≥ t) ≤ 2e1−t/σ ∧ 1 ≤ 2e− 5σ ,
so that (1) holds with K1 = 52 .

98
Lexture Notes on Statistics and Information Theory John Duchi

(2) if and only if (4) Thus, we see that up to constant numerical factors, the definition kXkψ1 =
supk≥1 k −1 E[|X|k ]1/k has the equivalent statements

P(|X| ≥ t) ≤ 2 exp(−t/(K1 kXkψ1 )) and E[exp(X/(K3 kXkψ1 ))] ≤ e.

Now, let us assume that (2) holds with K2 = 1, so that σ = kXkψ1 and that E[X] = 0. Then we
have E[X k ] ≤ k k kXkkψ1 , and
∞ ∞ ∞
X λk E[X k ] X kk X
E[exp(λX)] = 1 + ≤1+ λk
kXkkψ1 · ≤1+ λk kXkkψ1 ek ,
k! k!
k=2 k=2 k=2

1
the final inequality following because k! ≥ (k/e)k . Now, if |λ| ≤ 2ekXkψ , then we have
1


X
E[exp(λX)] ≤ 1 + λ2 e2 kXkψ1 (λ kXkψ1 e)k ≤ 1 + 2e2 kXk2ψ1 λ2 ,
k=0
P∞ −k = 2. Using 1 + x ≤ ex gives that (2) implies (4). For
as the final sum is at most k=0 2
the opposite direction, we may simply use that if (4) holds with K4 = 1 and K40 = 1, then
E[exp(X/σ)] ≤ exp(1), so that (3) holds.

4.5.3 Proof of Theorem 4.3.6


JCD Comment: I would like to write this. For now, check out Ledoux and Talagrand
[129, Theorem 4.12] or Koltchinskii [122, Theorem 2.2].

4.6 Bibliography
A few references on concentration, random matrices, and entropies include Vershynin’s extraordi-
narily readable lecture notes [170], upon which our proof of Theorem 4.1.11 is based, the compre-
hensive book of Boucheron, Lugosi, and Massart [34], and the more advanced material in Buldygin
and Kozachenko [41]. Many of our arguments are based off of those of Vershynin and Boucheron
et al. Kolmogorov and Tikhomirov [121] introduced metric entropy.

4.7 Exercises
Exercise 4.1 (Concentration of bounded random variables): Let X be a random variable taking
values in [a, b], where −∞ < a ≤ b < ∞. In this question, we show Hoeffding’s Lemma, that is,
that X is sub-Gaussian: for all λ ∈ R, we have
 2
λ (b − a)2

E[exp(λ(X − E[X]))] ≤ exp .
8
(b−a)2
(a) Show that Var(X) ≤ ( b−a 2
2 ) = 4 for any random variable X taking values in [a, b].

(b) Let
ϕ(λ) = log E[exp(λ(X − E[X]))].

99
Lexture Notes on Statistics and Information Theory John Duchi

Assuming that E[X] = 0 (convince yourself that this is no loss of generality) show that

E[X 2 etX ] E[XetX ]2


ϕ(0) = 0, ϕ0 (0) = 0, ϕ00 (t) = − .
E[etX ] E[etX ]2

(You may assume that derivatives and expectations commute, which they do in this case.)

(c) Construct a random variable Yt , defined for t ∈ R, such that Yt ∈ [a, b] and

Var(Yt ) = ϕ00 (t).

(You may assume X has a density for simplicity.)


λ2 (b−a)2
(d) Using the result of part (c), show that ϕ(λ) ≤ 8 for all λ ∈ R.

Exercise 4.2: In this question, we show how to use Bernstein-type (sub-exponential) inequal-
ities to give sharp convergence guarantees. Recall (Example 4.1.14, Corollary 4.1.18, and inequal-
ity (4.1.6)) that if Xi are independent bounded random variables with |Xi − E[X]| ≤ b for all i and
Var(Xi ) ≤ σ 2 , then
n n
( ! !)
5 nt2 nt
  
1X 1X 1
max P Xi ≥ E[X] + t , P Xi ≤ E[X] − t ≤ exp − min , .
n n 2 6 σ 2 2b
i=1 i=1

We consider minimization of loss functions ` over finite function classes F with ` ∈ [0, 1], so that if
L(f ) = E[`(f, Z)] then |`(f, Z) − L(f )| ≤ 1. Throughout this question, we let

L? = min L(f ) and f ? ∈ argmin L(f ).


f ∈F f ∈F

We will show that, roughly, a procedure based on picking an empirical risk minimizer is unlikely to
choose a function f ∈ F with bad performance, so that we obtain faster concentration guarantees.

(a) Argue that for any f ∈ F


2
  
  
b ) ≥ L(f ) + t ∨ P L(f
 1
b ) ≤ L(f ) − t ≤ exp − min 5 nt nt
P L(f , .
2 6 L(f )(1 − L(f )) 2

(b) Define the set of “bad” prediction functions F bad := {f ∈ F : L(f ) ≥ L? + }. Show that for
any fixed  ≥ 0 and any f ∈ F2 bad , we have

n2
  

?
 1 5 n
P L(f ) ≤ L +  ≤ exp − min
b , .
2 6 L? (1 − L? ) + (1 − ) 2

(c) Let fbn ∈ argminf ∈F L(f


b ) denote the empirical minimizer over the class F. Argue that it is
likely to have good performance, that is, for all  ≥ 0 we have

n2
  
  1 5 n
P L(fbn ) ≥ L(f ? ) + 2 ≤ card(F) · exp − min , .
2 6 L? (1 − L? ) + (1 − ) 2

100
Lexture Notes on Statistics and Information Theory John Duchi

(d) Using the result of part (c), argue that with probability at least 1 − δ,
q
|F | L? (1 − L? ) · log |Fδ |
r
? 4 log δ 12
L(fn ) ≤ L(f ) +
b + · √ .
n 5 n

Why is this better than an inequality based purely on the boundedness of the loss `, such as
Theorem 4.4.4 or Corollary 4.4.6? What happens when there is a perfect risk minimizer f ? ?

Exercise 4.3 (Likelihood ratio bounds and concentration): Consider a data release problem,
where given a sample x, we release a sequence of data Z1 , Z2 , . . . , Zn belonging to a discrete set Z,
where Zi may depend on Z1i−1 and x. We assume that the data has limited information about x
in the sense that for any two samples x, x0 , we have the likelihood ratio bound

p(zi | x, z1i−1 )
≤ eε .
p(zi | x0 , z1i−1 )

Let us control the amount of “information” (in the form of an updated log-likelihood ratio) released
by this sequential mechanism. Fix x, x0 , and define

p(z1 , . . . , zn | x)
L(z1 , . . . , zn ) := log .
p(z1 , . . . , zn | x0 )

(a) Show that, assuming the data Zi are drawn conditional on x,

t2
 
ε
P (L(Z1 , . . . , Zn ) ≥ nε(e − 1) + t) ≤ exp − .
2nε2

Equivalently, show that


 p 
P L(Z1 , . . . , Zn ) ≥ nε(eε − 1) + ε 2n log(1/δ) ≤ δ.

(b) Let γ ∈ (0, 1). Give the largest value of ε you can that is sufficient to guarantee that for any
test Ψ : Z n → {x, x0 }, we have

Px (Ψ(Z1n ) 6= x) + Px0 (Ψ(Z1n ) 6= x0 ) ≥ 1 − γ,

where Px and Px0 denote the sampling distribution of Z1n under x and x0 , respectively?

Exercise 4.4 (Marcinkiewicz-Zygmund inequality): Let Xi be independent random variables


with E[Xi ] = 0 and E[|Xi |p ] < ∞, where 1 ≤ p < ∞. Prove that
" n # " n p/2 #
X p X
E Xi ≤ Cp E |Xi |2
i=1 i=1

where Cp is a constant (that depends on p). As a corollary, derive that if E[|Xi |p ] ≤ σ p and p ≥ 2,
then
n
" #
p
1X σp
E Xi ≤ Cp p/2 .
n n
i=1

101
Lexture Notes on Statistics and Information Theory John Duchi

That is, sample means converge quickly to zero in higher moments. Hint: For any fixed x ∈ Rn , if
εi are i.i.d. uniform signs εi ∈ {±1}, then εT x is sub-Gaussian.
Exercise 4.5 (Small balls and anti-concentration): Let X be a nonnegative random variable
satisfying P(X ≤ ) ≤ c for some c < ∞ and all  > 0. Argue that if Xi are i.i.d. copies of X, then
n
!
1X
P Xi ≥ t ≥ 1 − exp(−2n [1/2 − 2ct]2+ )
n
i=1

for all t.
Exercise 4.6 (Lipschitz functions remain sub-Gaussian): Let X be σ 2 -sub-Gaussian and f :
R → R be L-Lipschitz, meaning that |f (x) − f (y)| ≤ L|x − y| for all x, y. Prove that there exists a
numerical constant C < ∞ such that f (X) is CL2 σ 2 -sub-Gaussian.
Exercise 4.7 (Sub-gaussian maxima): Let X1 , . . . , Xn be σ 2 -sub-gaussian (not necessarily inde-
pendent) random variables. Show that
p
(a) E[maxi Xi ] ≤ 2σ 2 log n.

(b) There exists a numerical constant C < ∞ such that E[maxi |Xi |p ] ≤ (Cpσ 2 log k)p/2 .

Exercise 4.8: Consider a binary classification problem with logistic loss `(θ; (x, y)) = log(1 +
exp(−yθT x)), where θ ∈ Θ := {θ ∈ Rd | kθk1 ≤ r} and y ∈ {±1}. Assume additionally that the
space X ⊂ {x ∈ Rd | kxk∞ ≤ b}. Define the empirical and population risks L b n (θ) := Pn `(θ; (X, Y ))
and L(θ) := P `(θ; (X, Y )), and let θbn = argminθ∈Θ L(θ).
b Show that with probability at least 1 − δ
iid
over (Xi , Yi ) ∼ P , q
rb log dδ
L(θbn ) ≤ inf L(θ) + C √
θ∈Θ n
where C < ∞ is a numerical constant (you need not specify this).
Exercise 4.9 (Sub-Gaussian constants of Bernoulli random variables): In this exercise, we will
derive sharp sub-Gaussian constants for Bernoulli random variables (cf. [106, Thm. 1] or [118, 24]),
showing
1 − 2p 2
log E[et(X−p) ] ≤ t for all t ≥ 0. (4.7.1)
4 log 1−p
p

(a) Define ϕ(t) = log(E[et(X−p) ]) = log((1 − p)e−tp + pet(1−p) ). Show that

ϕ0 (t) = E[Yt ] and ϕ00 (t) = Var(Yt )

pet(1−p)
where Yt = (1 − p) with probability q(t) := pet(1−p) +(1−p)e−tp
and Yt = −p otherwise.

(b) Show that ϕ0 (0) = 0 and that if p > 1


2, then Var(Yt ) ≤ Var(Y0 ) = p(1 − p). Conclude that
ϕ(t) ≤ p(1−p)
2 t2 for all t ≥ 0.
1−2p 1+δ
(c) Argue that p(1 − p) ≤ for p ∈ [0, 1]. Hint: Let p = for δ ∈ [0, 1], so that the
2 log 1−p
p
2
1+δ 2δ
Rδ 1
inequality is equivalent to log 1−δ ≤ 1−δ 2 . Then use that log(1 + δ) = 0 1+u du.

102
Lexture Notes on Statistics and Information Theory John Duchi

(d) Let C = 2 log 1−p 1−p


p and define s = Ct = 2 log p s, and let

1 − 2p 2
f (s) = Cs + Cps − log(1 − p + peCs ),
2
so that inequality (4.7.1) holds if and only if f (s) ≥ 0 for all s ≥ 0. Give f 0 (s) and f 00 (s).

(e) Show that f (0) = f (1) = f 0 (0) = f 0 (1) = 0, and argue that f 00 (s) changes signs at most twice
and that f 00 (0) = f 00 (1) > 0. Use this to show that f (s) ≥ 0 for all s ≥ 0.

JCD Comment: Perhaps use transportation inequalities to prove this bound, and
also maybe give Ordentlich and Weinberger’s “A Distribution Dependent Refinement
of Pinsker’s Inequality” as an exercise.
1−2p
Exercise 4.10: Let s(p) = . Show that s is concave on [0, 1].
log 1−p
p

Exercise 4.11: Prove Lemma 4.3.8.


JCD Comment: Add in some connections to the exponential family material. Some
ideas:

1. A hypothesis test likelihood ratio for them (see page 40 of handwritten notes)

2. A full learning guarantee with convergence of Hessian and everything, e.g., for logistic
regression?

3. In the Ledoux-Talagrand stuff, maybe worth going through example of logistic regres-
sion. Also, having working logistic example throughout? Helps clear up the structure
and connect with exponential families.

4. Maybe an exercise for Lipschitz functions with random Lipschitz constants?

103
Chapter 5

Generalization and stability

Concentration inequalities provide powerful techniques for demonstrating when random objects
that are functions of collections of independent random variables—whether sample means, functions
with bounded variation, or collections of random vectors—behave similarly to their expectations.
This chapter continues exploration of these ideas by incorporating the central thesis of this book:
that information theory’s connections to statistics center around measuring when (and how) two
probability distributions get close to one another. On its face, we remain focused on the main
objects of the preceding chapter, where we have a population probability distribution P on a space
X and some collection of functions f : X → R. We then wish to understand when we expect the
empirical distribution
n
1X
Pn := 1Xi ,
n
i=1
iid
defined by teh sample Xi ∼ P , to be close to the population P as measured by f . Following the
notation we introduce in Section 4.3, for P f := EP [f (X)], we again ask to have
n
1X 
Pn f − P f = f (Xi ) − EP [f (X)]
n
i=1

to be small simultaneously for all f .


In this chapter, however, we develop a family of tools based around PAC (probably approximately
correct) Bayesian bounds, where we slightly perturb the functions f of interest to average them in
some way; when these perturbations keep Pn f stable, we expect that Pn f ≈ P f , that is, the sample
generalizes to the population. These perturbations allow us to bring the tools of the divergence
measures we have developed to bear on the problems of convergence and generalization. Even more,
they allow us to go beyond the “basic” concentration inequalities to situations with interaction,
where a data analyst may evaluate some functions of Pn , then adaptively choose additional queries
or analyses to do on the sample sample X1n . This breaks standard statistical analyses—which
assume an a priori specified set of hypotheses or questions to be answered—but is possible to
address once we can limit the information the analyses release in precise ways that information-
theoretic tools allow. Modern work has also shown how to leverage these techniques, coupled with
computation, to provide non-vacuous bounds on learning for complicated scenarios and models to
which all classical bounds fail to apply, such as deep learning.

104
Lexture Notes on Statistics and Information Theory John Duchi

5.1 The variational representation of Kullback-Leibler divergence


The starting point of all of our generalization bounds is a surprisingly simply variational result,
which relates expectations, moment generating functions, and the KL-divergence in one single
equality. It turns out that this inequality, by relating means with moment generating functions
and divergences, allows us to prove generalization bounds based on information-theoretic tools and
stability.
Theorem 5.1.1 (Donsker-Varadhan variational representation). Let P and Q be distributions on
a common space X . Then
n o
Dkl (P ||Q) = sup EP [g(X)] − log EQ [eg(X) ] ,
g

where the supremum is taken over measurable functions g : X → R with EQ [eg(X) ] < ∞.
We give one proof of this result and one sketch of a proof, which holds when the underlying space
is discrete, that may be more intuitive: the first constructs a particular “tilting” of Q via the
function eg , and verifies the equality. The second relies on the discretization of the KL-divergence
and may be more intuitive to readers familiar with convex optimization: essentially, we expect this
result because the function log( kj=1 exj ) is the convex conjugate of the negative entropy. (See also
P
Exercise 5.1.)
Proof We may assume that P is absolutely continuous with respect to Q, meaning that Q(A) = 0
implies that P (A) = 0, as otherwise both sides are infinite by inspection. Thus, it is no loss of
generality to let P and Q have densities p and q.
Attainment in the equality is easy: we simply take g(x) = log p(x) q(x) , so that EQ [e
g(X) ] = 1. To

show that the right hand side is never larger than Dkl (P ||Q) requires a bit more work. To that
end, let g be any function such that EQ [eg(X) ] < ∞, and define the random variable Zg (x) =
eg(x) /EQ [eg(X) ], so that EQ [Z] = 1. Then using the absolute continuity of P w.r.t. Q, we have
     
p(X) q(X) dQ
EP [log Zg ] = EP log + log Zg (X) = Dkl (P ||Q) + EP log Zg
q(X) p(X) dP
 
dQ
≤ Dkl (P ||Q) + log EP Zg
dP
= Dkl (P ||Q) + log EQ [Zg ].

As EQ [Zg ] = 1, using that EP [log Zg ] = EP [g(X)] − log EQ [eg(X) ] gives the result.

Here is the second proof of Theorem 5.1.1, which applies when X is discrete and finite. That we
can approximate KL-divergence by suprema over finite partitions (as in definition (2.2.1)) suggests
that this approach works in general—which it can—but this requires some not completely trivial
approximations of EP [g] and EQ [eg ] by discretized versions of their expectations, which makes
things rather tedious.
Proof of Theorem 5.1.1, the finite case As we have assumed that P and Q have finite
supports, which we identify with {1, . . . , k} and p.m.f.s p, q ∈ ∆k = {p ∈ Rk+ | h1, pi = 1}. Define
fq (v) = log( kj=1 qj evj ), which is convex in v (recall Proposition 3.2.1). Then the supremum in
P
the variational representation takes the form
h(p) := sup {hp, vi − fq (v)} .
v∈Rk

105
Lexture Notes on Statistics and Information Theory John Duchi

If we can take derivatives and solve for zero, we are guaranteed to achieve the supremum. To that
end, note that
" #k
qi evi
∇v {hp, vi − fq (v)} = p − Pk ,
q evj
j=1 j i=1
p
so that setting vj = log qjj achieves p − ∇v fq (v) = p − p = 0 and hence the supremum. Noting that
p
log( kj=1 qj exp(log qjj )) = log( kj=1 pj ) = 0 gives h(p) = Dkl (p||q).
P P

The Donsker-Varadhan variational representation already gives a hint that we can use some
information-theoretic techniques to control the difference between an empirical sample and its
expectation, at least in an average sense. In particular, we see that for any function g, we have

EP [g(X)] ≤ Dkl (P ||Q) + log EQ [eg(X) ]

for any random variable X. Now, changing this on its head a bit, suppose that we consider a
collection of functions F and put two probability measures π and π0 on F, and consider Pn f − P f ,
where we consider f a random variable f ∼ π or f ∼ π0 . Then a consequence of the Donsker-
Varadhan theorem is that
Z Z
(Pn f − P f )dπ(f ) ≤ Dkl (π||π0 ) + log exp(Pn f − P f )dπ0 (f )

for any π, π0 . While this inequality is a bit naive—bounding a difference by an exponent seems
wasteful—as we shall see, it has substantial applications when we can upper bound the KL-
divergence Dkl (π||π0 ).

5.2 PAC-Bayes bounds


Probably-approximately-correct (PAC) Bayesian bounds proceed from a perspective similar to that
of the covering numbers and covering entropies we develop in Section 4.3, where if for a collection
of functions F there is a finite subset (a cover) {fv } such that each f ∈ F is “near” one of the
fv , then we need only control deviations of Pn f from P f for the elements of {fv }. In PAC-Bayes
bounds, we instead average functions f with other functions, and this averaging allows a similar
family of guarantees and applications.
Let us proceed with the main results. Let F be a collection of functions f : X → R, and
assume that each function f is σ 2 -sub-Gaussian, which we recall (Definition 4.1) means that
λ(f (X)−P f ) 2 2
R
E[e ] ≤ exp(λ σ /2) for all λ ∈ R, where P f = EP [f (X)] = f (x)dP (x) denotes the
expectation of f under P . The main theorem of this section shows that averages of the squared
error (Pn f − P f )2 of the empirical distribution Pn to P converge quickly to zero for all averaging
distributions π on functions f ∈ F so long as each f is σ 2 -sub-Gaussian, with the caveat that we
pay a cost for different choices of π. The key is that we choose some prior distribution π0 on F
first.
Theorem 5.2.1. Let Π be the collection of all probability distributions on the set F and let π0 be
a fixed prior probability distribution on f ∈ F. With probability at least 1 − δ,

8σ 2 Dkl (π||π0 ) + log 2δ


Z
(Pn f − P f )2 dπ(f ) ≤ simultaneously for all π ∈ Π.
3 n

106
Lexture Notes on Statistics and Information Theory John Duchi

Proof The key is to combine Example 4.1.12 with the variational representation that Theo-
rem 5.1.1 provides for KL-divergences. We state Example 4.1.12 as a lemma here.

Lemma 5.2.2. Let Z be a σ 2 -sub-Gaussian random variable. Then for λ ≥ 0,


2 1
E[eλZ ] ≤ q .
[1 − 2σ 2 λ]+

PWithout loss of generality, we assume that P f = 0 for all f ∈ F, and recall that Pn f =
1 n 2
n i=1 f (Xi ) is the empirical mean of f . Then we know that Pn f is σ /n-sub-Gaussian, and
−1/2
Lemma 5.2.2 implies that E[exp(λ(Pn f )2 )] ≤ 1 − 2λσ 2 /n +
 
for any f , and thus for any prior
π0 on f we have Z 
−1/2
exp(λ(Pn f ) )dπ0 (f ) ≤ 1 − 2λσ 2 /n + .
2

E

3n
Consequently, taking λ = λn := 8σ 2 , we obtain

Z  Z   
2 3n 2
E exp(λn (Pn f ) )dπ0 (f ) = E exp (Pn f ) dπ0 (f ) ≤ 2.
8σ 2

Markov’s inequality thus implies that


Z 
2
 2
P exp λn (Pn f ) dπ0 (f ) ≥ ≤ δ, (5.2.1)
δ
iid
where the probability is over Xi ∼ P .
Now, we use the Donsker-Varadhan equality (Theorem 5.1.1). Letting λ > 0, we define the
function g(f ) = λ(Pn f )2 , so that for any two distributions π and π0 on F, we have

Dkl (π||π0 ) + log exp(λ(Pn f )2 )dπ0 (f )


Z Z R
1 2
g(f )dπ(f ) = (Pn f ) dπ(f ) ≤ .
λ λ
This holds without any probabilistic qualifications, so using the application (5.2.1) of Markov’s
inequality with λ = λn , we thus see that with probability at least 1 − δ over X1 , . . . , Xn , simulta-
neously for all distributions π,

8σ 2 Dkl (π||π0 ) + log 2δ


Z
(Pn f )2 dπ(f ) ≤ .
3 n
This is the desired result (as we have assumed that P f = 0 w.l.o.g.).

By Jensen’s inequality (or Cauchy-Schwarz), it is immediate from Theorem 5.2.1 that we also
have
s
8σ 2 Dkl (π||π0 ) + log 2δ
Z
|Pn f − P f |dπ(f ) ≤ simultaneously for all π ∈ Π (5.2.2)
3 n

with probability at least 1 − δ, so that Eπ [|Pn f − P f |] is with high probability of order 1/ n. The
inequality (5.2.2) is the original form of the PAC-Bayes bound due to McAllester, with slightly

107
Lexture Notes on Statistics and Information Theory John Duchi

sharper constants and improved logarithmic dependence. The key is that stability, in the form of a
prior π0 and posterior π closeness, allow us to achieve reasonably tight control over the deviations
of random variables and functions with high probability.
Let us give an example, which is similar to many of our approaches in Section 4.4, to illustrate
some of the approaches this allows. The basic idea is that by appropriate choice of prior π0
and “posterior” π, whenever we have appropriately smooth classes of functions we achieve certain
generalization guarantees.

Example 5.2.3 (A uniform law for Lipschitz functions): Consider a case as in Section 4.4,
where we let L(θ) = P `(θ, Z) for some function ` : Θ × Z → R. Let Bd2 = {v ∈ Rd | kvk2 ≤ 1}
be the `2 -ball in Rd , and let us assume that Θ ⊂ rBd2 and additionally that θ 7→ `(θ, z) is
M -Lipschitz for all z ∈ Z. For simplicity, we assume that `(θ, z) ∈ [0, 2M r] for all θ ∈ Θ (we
may simply relativize our bounds by replacing ` by `(·, z) − inf θ∈Θ `(θ, z) ∈ [0, 2M r]).
If L
b n (θ) = Pn `(θ, Z), then Theorem 5.2.1 implies that
s
2 r2
Z  
8M 2
|L
b n (θ) − L(θ)|dπ(θ) ≤ Dkl (π||π0 ) + log
3n δ

for all π with probability at least 1 − δ. Now, let θ0 ∈ Θ be arbitrary, and for  > 0 (to be
chosen later) take π0 to be uniform on (r + )Bd2 and π to be uniform on θ0 + Bd2 . Then we
immediately see that Dkl (π||π0 ) = d log(1+ r ). Moreover, we have L
R
b n (θ)dπ(θ) ∈ L
b n (θ0 )±M 
and similarly for L(θ), by the M -Lipschitz continuity of `. For any fixed  > 0, we thus have
s
2M 2 r2
 
 r 2
|Ln (θ0 ) − L(θ0 )| ≤ 2M  +
b d log 1 + + log
3n  δ

rd
simultaneously for all θ0 ∈ Θ, with probability at least 1 − δ. By choosing  = n we obtain
that with probability at least 1 − δ,
s
8M 2 r2
 
2M rd  n 2
sup |Ln (θ) − L(θ)| ≤
b + d log 1 + + log .
θ∈Θ n 3n d δ
q
Thus, roughly, with high probability we have |L
b n (θ) − L(θ)| ≤ O(1)M r d
n log nd for all θ. 3

On the one hand, the result in Example 5.2.3 is satisfying: it applies to any Lipschitz function
and provides a uniform bound. On the other hand, when we compare to the results achievable for
specially structured linear function classes, then applying Rademacher complexity bounds—such
as Proposition 4.4.9 and Example 4.4.10—we have somewhat weaker results, in that they depend
on the dimension explicitly, while the Rademacher bounds do not exhibit this explicit dependence.
This means they can potentially apply in infinite dimensional spaces that Example 5.2.3 cannot.
We will give an example presently showing how to address some of these issues.

5.2.1 Relative bounds


In many cases, it is useful to have bounds that provide somewhat finer control than the bounds
we have presented. Recall from our discussion of sub-Gaussian and sub-exponential random vari-
ables, especially the Bennett and Bernstein-type inequalities (Proposition 4.1.20), that if a random

108
Lexture Notes on Statistics and Information Theory John Duchi

variable X satisfies |X| ≤ b but Var(X) ≤ σ 2  b2 , then X concentrates more quickly about
its mean than the convergence provided by naive application of sub-Gaussian concentration with
sub-Gaussian parameter b2 /8. To that end, we investigate an alternative to Theorem 5.2.1 that
allows somewhat sharper control.
The approach is similar to our derivation in Theorem 5.2.1, where we show that the moment
generating function of a quantity like Pn f − P f is small (Eq. (5.2.1)) and then relate this—via the
Donsker-Varadhan change of measure in Theorem 5.1.1—to the quantities we wish to control. In
the next proposition, we provide relative bounds on the deviations of functions from their means.
To make this precise, let F be a collection of functions f : X → R, and let σ 2 (f ) := Var(f (X)) be
the variance of functions in F. We assume the class satisfies the Bernstein condition (4.1.7) with
parameter b, that is,
h i k!
E (f (X) − P f )k ≤ σ 2 (f )bk−2 for k = 3, 4, . . . . (5.2.3)
2
This says that the second moment of functions f ∈ F bounds—with the additional boundedness-
type constant b—the higher moments of functions in f . We then have the following result.

Proposition 5.2.4. Let F be a collection of functions f : X → R satisfying the Bernstein condi-


1
tion (5.2.3). Then for any |λ| ≤ 2b , with probability at least 1 − δ,
Z Z Z  
1 1
λ P f dπ(f ) − λ2 σ 2 (f )dπ(f ) ≤ λ Pn f dπ(f ) + Dkl (π||π0 ) + log
n δ
simultaneously for all π ∈ Π.

Proof We begin with an inequality on the moment generating function of random variables
satisfying the Bernstein condition (4.1.7), that is, that |E[(X − µ)k ]| ≤ k! 2 k−2 for k ≥ 2. In this
2σ b
case, Lemma 4.1.19 implies that
E[eλ(X−µ) ] ≤ exp(λ2 σ 2 )
for |λ| ≤ 1/(2b). As a consequence, for any f in our collection F, we see that if we define

∆n (f, λ) := λ Pn f − P f − λσ 2 (f ) ,
 

we have that
E[exp(n∆n (f, λ))] = E[exp(λ(f (X) − P f ) − λ2 σ 2 (f ))]n ≤ 1
1
for all n, f ∈ F, and |λ| ≤ 2b .Then, for any fixed measure π0 on F, Markov’s inequality implies
that Z 
1
P exp(n∆n (f, λ))dπ0 (f ) ≥ ≤ δ. (5.2.4)
δ
Now, as in the proof of Theorem 5.2.1, we use the Donsker-Varadhan Theorem 5.1.1 (change of
measure), which implies that
Z Z
n ∆n (f, λ)dπ(f ) ≤ Dkl (π||π0 ) + log exp(n∆n (f, λ))dπ0 (f )

for all distributions π. Using inequality (5.2.4), we obtain that with probability at least 1 − δ,
Z  
1 1
∆n (f, λ)dπ(f ) ≤ Dkl (π||π0 ) + log
n δ

109
Lexture Notes on Statistics and Information Theory John Duchi

for all π. As this holds for any fixed |λ| ≤ 1/(2b), this gives the desired result by rearranging.

We would like to optimize over the bound in Proposition 5.2.4 by choosing the “best” λ. If we
could choose the optimal λ, by rearranging Proposition 5.2.4 we would obtain the bound
 
2 1 h 1i
Eπ [P f ] ≤ Eπ [Pn f ] + inf λEπ [σ (f )] + Dkl (π||π0 ) + log
λ>0 nλ δ
r
Eπ [σ 2 (f )] h 1i
= Eπ [Pn f ] + 2 Dkl (π||π0 ) + log
n δ
simultaneously for all π, with probability at least 1−δ. The problem with this approach is two-fold:
first, we cannot arbitrarily choose λ in Proposition 5.2.4, and second, the bound above depends on
the unknown population variance σ 2 (f ). It is thus of interest to understand situations in which
we can obtain similar guarantees, but where we can replace unknown population quantities on the
right side of the bound with known quantities.
To that end, let us consider the following condition, a type of relative error condition related
to the Bernstein condition (4.1.7): for each f ∈ F,

σ 2 (f ) ≤ bP f. (5.2.5)

This condition is most natural when each of the functions f take nonnegative values—for example,
when f (X) = `(θ, X) for some loss function ` and parameter θ of a model. If the functions f are
nonnegative and upper bounded by b, then we certainly have σ 2 (f ) ≤ E[f (X)2 ] ≤ bE[f (X)] = bP f ,
so that Condition (5.2.5) holds. Revisiting Proposition 5.2.4, we rearrange to obtain the following
theorem.

Theorem 5.2.5. Let F be a collection of functions satisfying the Bernstein condition (5.2.3) as in
Proposition 5.2.4, and in addition, assume the variance-bounding condition (5.2.5). Then for any
1
0 ≤ λ ≤ 2b , with probability at least 1 − δ,

λb 1 1h 1i
Eπ [P f ] ≤ Eπ [Pn f ] + Eπ [Pn f ] + Dkl (π||π0 ) + log
1 − λb λ(1 − λb) n δ

for all π.

Proof We use condition (5.2.5) to see that

λEπ [P f ] − λ2 bEπ [P f ] ≤ λEπ [P f ] − λ2 Eπ [σ 2 (f )],

apply Proposition 5.2.4, and divide both sides of the resulting inequality by λ(1 − λb).

To make this uniform in λ, thus achieving a tighter bound (so that we need not pre-select λ),
1 λb
we choose multiple values of λ and apply a union bound. To that end, let 1 + η = 1−λb , or η = 1−λb
1 (1+η)2
and λb(1−λb) = η , so that the inequality in Theorem 5.2.1 is equivalent to

(1 + η)2 b h 1i
Eπ [P f ] ≤ Eπ [Pn f ] + ηEπ [Pn f ] + Dkl (π||π0 ) + log .
η n δ

110
Lexture Notes on Statistics and Information Theory John Duchi

Using that our choice of η ∈ [0, 1], this implies


1 bh 1 i 3b h 1i
Eπ [P f ] ≤ Eπ [Pn f ] + ηEπ [Pn f ] + Dkl (π||π0 ) + log + Dkl (π||π0 ) + log .
ηn δ n δ
Now, take η1 = 1/n, . . . , ηn = 1. Then by optimizing over η ∈ {η1 , . . . , ηn } (which is equivalent, to
within a 1/n factor, to optimizing over 0 < η ≤ 1) and applying a union bound, we obtain

Corollary 5.2.6. Let the conditions of Theorem 5.2.5 hold. Then with probability at least 1 − δ,
r
bEπ [Pn f ] h ni 1  h n i
Eπ [P f ] ≤ Eπ [Pn f ] + 2 Dkl (π||π0 ) + log + Eπ [Pn f ] + 5b Dkl (π||π0 ) + log ,
n δ n δ
simultaneously for all π on F.

Proof By a union bound, we have


1 bh n i 3b h ni
Eπ [P f ] ≤ Eπ [Pn f ] + ηEπ [Pn f ] + Dkl (π||π0 ) + log + Dkl (π||π0 ) + log
ηn δ n δ

for each η ∈ {1/n, . . . , 1}. We consider two cases. In the first, assume that Eπ [Pn f ] ≤ nb (Dkl (π||π0 )+
log nδ . Then taking η = 1 above evidently gives the result. In the second, we have Eπ [Pn f ] >
b n
n (Dkl (π||π0 ) + log δ ), and we can set
s
b
(Dkl (π||π0 ) + log nδ )
η? = n ∈ (0, 1).
Eπ [Pn f ]
1
Choosing η to be the smallest value ηk in {η1 , . . . , ηn } with ηk ≥ η? , so that η? ≤ η ≤ η? + n then
implies the claim in the corollary.

5.2.2 A large-margin guarantee


Let us revisit the loss minimization approaches central to Section 4.4 and Example 5.2.3 in the
context of Corollary 5.2.6. We will investigate an approach to achieve convergence guarantees that
are (nearly) independent of dimension, focusing on 0-1 losses in a binary classification problem.
Consider a binary classification problem with data (x, y) ∈ Rd × {±1}, where we make predictions
hθ, xi (or its sign), and for a margin penalty γ ≥ 0 we define the loss

`γ (θ; (x, y)) = 1 {hθ, xiy ≤ γ} .

We call the quantity hθ, xiy the margin of θ on the pair (x, y), noting that when the margin is
large, hθ, xi has the same sign as y and is “confident” (i.e. far from zero). For shorthand, let us
define the expected and empirical losses at margin γ by

Lγ (θ) := P `γ (θ; (X, Y )) and L


b γ (θ) := Pn `γ (θ; (X, Y )).

Consider the following scenario: the data x lie in a ball of radius b, so that kxk2 ≤ b; note that
the losses `γ and `0 satisfy the Bernstein (5.2.3) and self-bounding (5.2.5) conditions with constant
1 as they take values in {0, 1}. We then have the following proposition.

111
Lexture Notes on Statistics and Information Theory John Duchi

Proposition 5.2.7. Let the above conditions on the data (x, y) hold and let the margin γ > 0 and
radius r < ∞. Then with probability at least 1 − δ,
√ rb log n p r2 b2 log nδ
 
1
P (hθ, XiY ≤ 0) ≤ 1 + Pn (hθ, XiY ≤ γ) + 8 √ δ Pn (hθ, XiY ≤ γ) + C
n γ n γ2n

simultaneously for all kθk2 ≤ r, where C is a numerical constant independent of the problem
parameters.

Proposition 5.2.7 provides a “dimension-free” guarantee—it depends only on the `2 -norms kθk2
and kxk2 —so that it can apply equally in infinite dimensional spaces. The key to the inequality
is that if we can find a large margin predictor—for example, one achieved by a support vector
machine or, more broadly, by minimizing a convex loss of the form
n
1X
minimize φ(hXi , θiYi )
kθk2 ≤r n
i=1

for some decreasing convex φ : R → R+ , e.g. φ(t) = [1 − t]+ or φ(t) = log(1 + e−t )—then we get
strong generalization performance guarantees relative to the empirical margin γ. As one particular
instantiation of this approach, suppose we can obtain a perfect classifier with positive margin: a
vector θ with kθk2 ≤ r such that hθ, Xi iYi ≥ γ for each i = 1, . . . , n. Then Proposition 5.2.7
guarantees that
r2 b2 log nδ
P (hθ, XiY ≤ 0) ≤ C
γ2n
with probability at least 1 − δ.
Proof Let π0 be N(0, τ 2 I) for some τ > 0 to be chosen, and let π be N(θ,
b τ 2 I) for some θb ∈ Rd
satisfying kθk
b 2 ≤ r. Then Corollary 5.2.6 implies that

Eπ [Lγ (θ)]
s
Eπ [L
b γ (θ)] h ni 1  b γ (θ)] + C Dkl (π||π0 ) + log n
h i
≤ Eπ [L
b γ (θ)] + 2 Dkl (π||π0 ) + log + Eπ [L
n δ n δ
s
h 2  h 2 i
b γ (θ)] + 2 Eπ [Lγ (θ)] r + log n + 1 Eπ [L b γ (θ)] + C r + log n
b i
≤ Eπ [L
n 2τ 2 δ n 2τ 2 δ

simultaneously for all θb satisfying kθk


b 2 ≤ r with probability at least 1 − δ, where we have used that
2
Dkl N(θ, τ I)||N(0, τ I) = kθk2 /(2τ 2 ).
2 2

Let us use the margin assumption. Note that if Z ∼ N(0, τ 2 I), then for any fixed θ0 , x, y we
have

`0 (θ0 ; (x, y)) − P(Z > x ≥ γ) ≤ E[`γ (θ0 + Z; (x, y))] ≤ `2γ (θ0 ; (x, y)) + P(Z > x ≥ γ)

where the middle expectation is over Z ∼ N(0, τ 2 I). Using the τ 2 kxk22 -sub-Gaussianity of Z > x, we
can obtain immediately that if kxk2 ≤ b, we have

γ2 γ2
   
`0 (θ0 ; (x, y)) − exp − 2 2 ≤ E[`γ (θ0 + Z; (x, y))] ≤ `2γ (θ0 ; (x, y)) + exp − 2 2 .
2τ b 2τ b

112
Lexture Notes on Statistics and Information Theory John Duchi

Returning to our earlier bound, we evidently have that if kxk2 ≤ b for all x ∈ X , then with
probability at least 1 − δ, simultaneously for all θ ∈ Rd with kθk2 ≤ r,
s

γ 2
 b 2γ (θ) + exp(− γ22 2 ) h r2
L ni
2τ b
L0 (θ) ≤ L2γ (θ) + 2 exp − 2 2 + 2
b + log
2τ b n 2τ 2 δ
2 2
   
1 b γ h r n i
+ L2γ (θ) + exp − 2 2 + C 2
+ log .
n 2τ b 2τ δ
2
Setting τ 2 = 2b2γlog n , we immediately see that for any choice of margin γ > 0, we have with
probability at least 1 − δ that
s
2b 1 hb b ih r2 b2 log n ni
L0 (θ) ≤ Lb 2γ (θ) + +2 L2γ (θ) + + log
n n n 2γ 2 δ
r2 b2 log n
 
1 b 1 h n i
+ L2γ (θ) + + C + log
n n 2γ 2 δ

for all kθk2 ≤ r.


Rewriting (replacing 2γ with γ) and recognizing that with no loss of generality we may take γ
such that rb ≥ γ gives the claim of the proposition.

5.2.3 A mutual information bound


An alternative perspective of the PAC-Bayesian bounds that Theorem 5.2.1 gives is to develop
bounds based on mutual information, which is also central to the interactive data analysis set-
ting in the next section. We present a few results along these lines here. Assume the setting of
Theorem 5.2.1, so that F consists of σ 2 -sub-Gaussian functions. Let us assume the following ob-
iid
servational model: we observe X1n ∼ P , and then conditional on the sample X1n , draw a (random)
function F ∈ F following the distribution π(· | X1n ). Assuming the prior π0 is fixed, Theorem 5.2.1
guarantees that with probability at least 1 − δ over X1n ,

8σ 2
 
2 n n 2
E[(Pn F − P F ) | X1 ] ≤ Dkl (π(· | X1 )||π0 ) + log ,
3n δ

where the expectation is taken over F ∼ π(· | X1n ), leaving the sample fixed. Now, consider choosing
π0 to be the average over all samples X1n of π, that is, π0 (·) = EP [π(· | X1n )], the expectation taken
iid
over X1n ∼ P . Then by definition of mutual information,

I(F ; X1n ) = EP [Dkl (π(· | X1n )||π0 )] ,

and by Markov’s inequality we have


1
P(Dkl (π(· | X1n )||π0 ) ≥ K · I(F ; X1n )) ≤
K
for all K ≥ 0. Combining these, we obtain the following corollary.

113
Lexture Notes on Statistics and Information Theory John Duchi

Corollary 5.2.8. Let F be chosen according to any distribution π(· | X1n ) conditional on the sample
iid
X1n . Then with probability at least 1 − δ0 − δ1 over the sample X1n ∼ P ,

8σ 2 I(F ; X1n )
 
2 n 2
E[(Pn F − P F ) | X1 ] ≤ + log .
3n δ0 δ1

This corollary shows that if we have any procedure—say, a learning procedure or otherwise—
that limits the information between a sample X1n and an output F , then we are guaranteed that
F generalizes. Tighter analyses of this are possible, though not our focus here, just that already
there should be an inkling that limiting information between input samples and outputs may be
fruitful.

5.3 Interactive data analysis


A major challenge in modern data analysis is that analyses are often not the classical statistics and
scientific method setting. In the scientific method—forgive me for being a pedant—one proposes
a hypothesis, the status quo or some other belief, and then designs an experiment to falsify that
hypothesis. Then, upon performing the experiment, there are only two options: either the experi-
mental results contradict the hypothesis (that is, we must reject the null) so that the hypothesis is
false, or the hypothesis remains consistent with available data. In the classical (Fisherian) statis-
tics perspective, this typically means that we have a single null hypothesis H0 before observing a
sample, we draw a sample X ∈ X , and then for some test statistic T : X → R with observed value
tobserved = T (X), we compute the probability under the null of observing something as extreme as
what we observed, that is, the p-value p = PH0 (T (X) ≥ tobserved ).
Yet modern data analyses are distant from this pristine perspective for many reasons. The
simplest is that we often have a number of hypotheses we wish to test, not a single one. For example,
in biological applications, we may wish to investigate the associations between the expression of
number of genes and a particular phenotype or disease; each gene j then corresponds to a null
hypothesis H0,j that gene j is independent of the phenotype. There are numerous approaches to
addressing the challenges associated with such multiple testing problems—such as false discovery
rate control, familywise error rate control, and others—with whole courses devoted to the challenges.
Even these approaches to multiple testing and high-dimensional problems do not truly capture
modern data analyses, however. Indeed, in many fields, researchers use one or a few main datasets,
writing papers and performing multiple analyses on the same dataset. For example, in medicine,
the UK Biobank dataset [163] has several thousand citations (as of 2023), many of which build
on one another, with early studies coloring the analyses in subsequent studies. Even in situations
without a shared dataset, analyses present researchers with huge degrees of freedom and choice.
A researcher may study a summary statistic of his or her sampled data, or a plot of a few simple
relationships, performing some simple data exploration—which statisticians and scientists have
advocated for 50 years, dating back at least to John Tukey!—but this means that there are huge
numbers of potential comparisons a researcher might make (that he or she does not). This “garden
of forking paths,” as Gelman and Loken [91] term it, causes challenges even when researchers are
not “p-hacking” or going on a “fishing expedition” to try to find publishable results. The problem
in these studies and approaches is that, because we make decisions that may, even only in a small
way, depend on the data observed, we have invalidated all classical statistical analyses.
To that end, we now consider interactive data analyses, where we perform data analyses se-
quentially, computing new functions on a fixed sample X1 , . . . , Xn after observing some initial

114
Lexture Notes on Statistics and Information Theory John Duchi

information about the sample. The starting point of our approach is similar to our analysis of
PAC-Bayesian learning and generalization: we observe that if the function we decide to compute
on the data X1n is chosen without much information about the data at hand, then its value on the
sample should be similar to its values on the full population. This insight dovetails with what we
have seen thus far, that appropriate “stability” in information can be useful and guarantee good
future performance.

5.3.1 The interactive setting


We do not consider the interactive data analysis setting in full, rather, we consider a stylized
approach to the problem, as it captures many of the challenges while being broad enough for
different applications. In particular, we focus on the statistical queries setting, where a data
analyst wishes to evaluate expectations
EP [φ(X)] (5.3.1)
iid
of various functionals φ : X → R under the population P using a sample X1n ∼ P . Certainly,
numerous problems problems are solvable using statistical queries (5.3.1). Means use φ(x) = x,
while we can compute variances using the two statistical queries φ1 (x) = x and φ2 (x) = x2 , as
Var(X) = EP [φ2 (X)] − EP [φ1 (X)]2 .
Classical algorithms for the statistical query problem simply return sample means Pn φ :=
1 Pn
n i=1 φ(Xi ) given a query φ : X → R. When the number of queries to be answered is not chosen
adaptively, this means we can typically answer a large number relatively accurately; indeed, if we
have a finite collection Φ of σ 2 -sub-Gaussian φ : X → R, then we of course have
r !
2σ 2 2
P max |Pn φ − P φ| ≥ (log(2|Φ|) + t) ≤ e−t for t ≥ 0
φ∈Φ n

by Corollary 4.1.10 (sub-Gaussian concentration) and a union bound. Thus, so long as |Φ| is not
exponential in the sample size n, we expect uniformly high accuracy.

Example 5.3.1 (Risk minimization via statistical queries): Suppose that we are in the loss-
minimization setting (4.4.2), where the losses `(θ, Xi ) are convex and differentiable in θ. Then
gradient descent applied to Lb n (θ) = Pn `(θ, X) will converge to a minimizing value of Lb n . We
can evidently implement gradient descent by a sequence of statistical queries φ(x) = ∇θ `(θ, x),
iterating
θ(k+1 ) = θ(k) − αk Pn φ(k) , (5.3.2)
where φ(k) = ∇θ `(θ(k) , x) and αk is a stepsize. 3

One issue with the example (5.3.1) is that we are interacting with the dataset, because each
sequential query φ(k) depends on the previous k − 1 queries. (Our results on uniform convergence
of empirical functionals and related ideas address many of these challenges, so that the result of
the process (5.3.2) will be well-behaved regardless of the interactivity.)
We consider an interactive version of the statistical query estimation problem. In this version,
there are two parties: an analyst (or statistician or learner), who issues queries φ : X → R, and
a mechanism that answers the queries to the analyst. We index our functionals φ by t ∈ T for a
(possibly infinite) set T , so we have a collection {φt }t∈T . In this context, we thus have the following
scheme:

115
Lexture Notes on Statistics and Information Theory John Duchi

Input: Sample X1n drawn i.i.d. P , collection {φt }t∈T of possible queries
Repeat: for k = 1, 2, . . .

i. Analyst chooses index Tk ∈ T and query φ := φTk

ii. Mechanism responds with answer Ak approximating P φ = EP [φ(X)] using X1n

Figure 5.1: The interactive statistical query setting

Of interest in the iteration 5.1 is that we interactively choose T1 , T2 , . . . , Tk , where the choice Ti
may depend on our approximations of EP [φTj (X)] for j < i, that is, on the results of our previous
queries. Even more broadly, the analyst may be able to choose the index Tk in alternative ways
depending on the sample X1n , and our goal is to still be able to accurately compute expectations
P φT = EP [φT (X)] when the index T may depend on X1n . The setting in Figure 5.1 clearly breaks
with the classical statistical setting in which an analysis is pre-specified before collecting data, but
more closely captures modern data exploration practices.

5.3.2 Second moment errors and mutual information


The starting point of our derivation is the following result, which follows from more or less identical
arguments to those for our PAC-Bayesian bounds earlier.

Theorem 5.3.2. Let {φt }t∈T be a collection of σ 2 -sub-Gaussian functions φt : X → R. Then for
any random variable T and any λ > 0,
 
2 1 n 1  2

E[(Pn φT − P φT ) ] ≤ I(X1 ; T ) − log 1 − 2λσ /n +
λ 2

and r
2σ 2
|E[Pn φT ] − E[P φT ]| ≤I(X1n ; T )
n
where the expectations are taken over T and the sample X1n .

Proof The proof is similar to that of our first basic PAC-Bayes result in Theorem 5.2.1. Let
us assume w.l.o.g. that P φt = 0 for all t ∈ T , noting that then Pn φt is σ 2 /n-sub-Gaussian. We
−1/2
prove the first result first. Lemma 5.2.2 implies that E[exp(λ(Pn φt )2 )] ≤ 1 − 2λσ 2 /n + for each


t ∈ T . As a consequence, we obtain via the Donsker-Varadhan equality (Theorem 5.1.1) that


Z  (i)  Z 
λE (Pn φt )2 dπ(t) ≤ E[Dkl (π||π0 )] + E log exp(λ(Pn φt )2 )dπ0 (t)
(ii)
Z 
2
≤ E[Dkl (π||π0 )] + log E exp(λ(Pn φt ) )dπ0 (t)
(iii) 1
log 1 − 2λσ 2 /n +
 
≤ E[Dkl (π||π0 )] −
2
for all distributions π on T , which may depend on Pn , where the expectation E is taken over the
iid
sample X1n ∼ P . (Here inequality (i) is Theorem 5.1.1, inequality (ii) is Jensen’s inequality, and

116
Lexture Notes on Statistics and Information Theory John Duchi

inequality (iii) is Lemma 5.2.2.) Now, let π0 be the marginal distribution on T (marginally over
all observations X1n ), and let π denote the posterior of T conditional on the sample X1n . Then
E[Dkl (π||π0 )] = I(X1n ; T ) by definition of the mutual information, giving the bound on the squared
error.
For the second result, note that the Donsker-Varadhan equality implies
λ2 σ 2
Z  Z
λE Pn φt dπ(t) ≤ E[Dkl (π||π0 )] + log E[exp(λPn φt )]dπ0 (t) ≤ I(X1n ; T ) + .
2n
p
Dividing both sides by λ gives E[Pn φT ] ≤ 2σ 2 I(X1n ; T )/n, and performing the same analysis with
−φT gives the second result of the theorem.

The key in the theorem is that if the mutual information—the Shannon information—I(X; T )
between the sample X and T is small, then the expected squared error can be small. To make this
n
a bit clearer, let us choose values for λ in the theorem; taking λ = 2eσ 2 gives the following corollary.

Corollary 5.3.3. Let the conditions of Theorem 5.3.2 hold. Then


2eσ 2 5σ 2
E[(Pn φT − P φT )2 ] ≤ I(X1n ; T ) + .
n 4n
Consequently, if we can limit the amount of information any particular query T (i.e., φT ) contains
about the actual sample X1n , then guarantee reasonably high accuracy in the second moment errors
(Pn φT − P φT )2 .

5.3.3 Limiting interaction in interactive analyses


Let us now return to the interactive data analysis setting of Figure 5.1, where we recall the stylized
application of estimating mean functionals P φ for φ ∈ {φt }t∈T . To motivate a more careful ap-
proach, we consider a simple example to show the challenges that may arise even with only a single
“round” of interactive data analysis. Naively answering queries accurately—using the mechanism
Pn φ that simply computes the sample average—can easily lead to problems:
Example 5.3.4 (A stylized correlation analysis): Consider the following stylized genetics
experiment. We observe vectors X ∈ {−1, 1}k , where Xj = 1 if gene j is expressed and −1
otherwise. We also observe phenotypes Y ∈ {−1, 1}, where Y = 1 indicates appearance of
the phenotype. In our setting, we will assume that the vectors X are uniform on {−1, 1}k
and independent of Y , but an experimentalist friend of ours wishes to know if there exists a
vector v with kvk2 = 1 such that the correlation between v T X and Y is high, meaning that
v T X is associated with Y . In our notation here, we have index set {v ∈ Rk | kvk2 = 1}, and
by Example 4.1.6, Hoeffding’s lemma, and the independence of the coordinates of X we have
that v T XY is kvk22 /4 = 1/4-sub-Gaussian. Now, we recall the fact that if Zj , j = 1, . . . , k, are
σ 2 -sub-Gaussian, then for any p ≥ 1, we have
E[max |Zj |p ] ≤ (Cpσ 2 log k)p/2
j

for a numerical constant C. That is, powers of sub-Gaussian maxima grow at most logarith-
mically. Indeed, by Theorem 4.1.11, we have for any q ≥ 1 by Hölder’s inequality that
X 1/q
p pq
E[max |Zj | ] ≤ E |Zj | ≤ k 1/q (Cpqσ 2 )p/2 ,
j
j

117
Lexture Notes on Statistics and Information Theory John Duchi

and setting q = log k gives the inequality. Thus, we see that for any a priori fixed v1 , . . . , vk , vk+1 ,
we have
log k
E[max(vjT (Pn Y X))2 ] ≤ O(1) .
j n
If instead we allow a single interaction, the problem is different. We issue queries associated
with v = e1 , . . . , ek , the k standard basis vectors; then we simply set Vk+1 = Pn Y X/ kPn Y Xk2 .
Then evidently
k
T
E[(Vk+1 (Pn Y X))2 ] = E[kPn Y Xk22 ] = ,
n
which is exponentially larger than in the non-interactive case. That is, if an analyst is allowed
to interact with the dataset, he or she may be able to discover very large correlations that are
certainly false in the population, which in this case has P XY = 0. 3

Example 5.3.4 shows that, without being a little careful, substantial issues may arise in interac-
tive data analysis scenarios. When we consider our goal more broadly, which is to be able to provide
accurate approximations to P φ for queries φ chosen adaptively for any population distribution P
and φ : X → [−1, 1], it is possible to construct quite perverse situations, where if we compute
sample expectations Pn φ exactly, one round of interaction is sufficient to find a query φ for which
Pn φ − P φ ≥ 1.

Example 5.3.5 (Exact query answering allows arbitrary corruption): Suppose we draw a
iid
sample X1n of size n on a sample space X = [m] with Xi ∼ Uniform([m]), where m ≥ 2n. Let
Φ be the collection of all functions φ : [m] → [−1, 1], so that P(|Pn φ − P φ| ≥ t) ≤ exp(−nt2 /2)
for any fixed φ. Suppose that in the interactive scheme in Fig. 5.1, we simply release answers
A = Pn φ. Consider the following query:

φ(x) = n−x for x = 1, 2, . . . , m.

Then by inspection, we see that


m
X
Pn φ = n−j card({Xi | Xi = j})
j=1
1 1 1
= card({Xi | Xi = 1}) + 2 card({Xi | Xi = 1}) + · · · + m card({Xi | Xi = m}).
n n n
It is clear that given Pn φ, we can reconstruct the sample counts exactly. Then if we define a
n
second query φ2 (x) = 1 for x ∈ X1n and φ2 (x) = −1 for x 6∈ X1n , we see that P φ2 ≤ m − 1,
while Pn φ2 = 1. The gap is thus
n
E[Pn φ2 − P φ2 ] ≥ 2 − ≥ 1,
m
which is essentially as bad as possible. 3

More generally, when one performs an interactive data analysis (e.g. as in Fig. 5.1), adapting
hypotheses while interacting with a dataset, it is not a question of statistical significance or mul-
tiplicity control for the analysis one does, but for all the possible analyses one might have done
otherwise. Given the branching paths one might take in an analysis, it is clear that we require
some care.

118
Lexture Notes on Statistics and Information Theory John Duchi

With that in mind, we consider the desiderata for techniques we might use to control information
in the indices we select. We seek some type of stability in the information algorithms provide
to a data analyst—intuitively, if small changes to a sample do not change the behavior of an
analyst substantially, then we expect to obtain reasonable generalization bounds. If outputs of a
particular analysis procedure carry little information about a particular sample (but instead provide
information about a population), then Corollary 5.3.3 suggests that any estimates we obtain should
be accurate.
To develop this stability theory, we require two conditions: first, that whatever quantity we
develop for stability should compose adaptively, meaning that if we apply two (randomized) algo-
rithms to a sample, then if both are appropriately stable, even if we choose the second algorithm
because of the output of the first in arbitrary ways, they should remain jointly stable. Second, our
notion should bound the mutual information I(X1n ; T ) between the sample X1n and T . Lastly, we
remark that this control on the mutual information has an additional benefit: by the data process-
ing inequality, any downstream analysis we perform that depends only on T necessarily satisfies the
same stability and information guarantees as T , because if we have the Markov chain X1n → T → V
then I(X1n ; V ) ≤ I(X1n ; T ).
We consider randomized algorithms A : X n → A, taking values in our index set A, where
A(X1n ) ∈ A is a random variable that depends on the sample X1n . For simplicity in derivation,
we abuse notation in this section, and for random variables X and Y with distributions P and Q
respectively, we denote
Dkl (X||Y ) := Dkl (P ||Q) .
We then ask for a type of leave-one-out stability for the algorithms A, where A is insensitive to the
changes of a single example (on average).

Definition 5.1. Let ε ≥ 0. A randomized algorithm A : X n → A is ε-KL-stable if for each


i ∈ {1, . . . , n} there is a randomized Ai : X n−1 → A such that for every sample xn1 ∈ X n ,
n
1X
Dkl A(xn1 )||Ai (x\i ) ≤ ε.

n
i=1

Examples may be useful to understand Definition 5.1.

Example 5.3.6 (KL-stability in mean estimation: Gaussian noise addition): Suppose we


wish to estimate a mean, and that P xi ∈ [−1, 1] are all real-valued. Then a natural statistic
is to simply compute A(xn1 ) = n1 ni=1 xi . In this case, without randomization, we P will have
infinite KL-divergence between A(xn1 ) and Ai (x\i ). If instead we set A(xn1 ) = n1 ni=1 xi + Z
for Z ∼ N(0, σ 2 ), and similarly Ai = n1 j6=i xj + Z, then we have (recall Example 2.1.7)
P

n n
1X n
 1 X 1 2 1
Dkl A(x1 )||A(x\i ) = x ≤ 2 2,
n 2nσ 2 n2 i 2σ n
i=1 i=1

so that a the sample mean of a bounded random variable perturbed with Guassian noise is
ε = 2σ12 n2 -KL-stable. 3

We can consider other types of noise addition as well.

Example 5.3.7 (KL-stability in mean estimation: Laplace noise addition): Let the conditions
of Example 2.1.7 hold, but suppose instead of Gaussian noise we add scaled Laplace noise,

119
Lexture Notes on Statistics and Information Theory John Duchi

that is, A(xn1 ) = n1 ni=1 xi + Z for Z with density p(z) = 2σ 1


P
exp(−|z|/σ), where σ > 0. Then
using that if Lµ,σ denotes the Laplace distribution with shape σ and mean µ, with density
1
p(z) = 2σ exp(−|z − µ|/σ), we have
Z |µ1 −µ0 |
1
Dkl (Lµ0 ,σ ||Lµ1 ,σ ) = 2 exp(−z/σ)(|µ1 − µ0 | − z)dz
σ 0
|µ1 − µ0 |2
 
|µ1 − µ0 | |µ1 − µ0 |
= exp − −1+ ≤ ,
σ σ 2σ 2
we see that in this case the sample mean of a bounded random variable perturbed with Laplace
noise is ε = 2σ12 n2 -KL-stable, where σ is the shape parameter. 3
The two key facts are that KL-stable algorithms compose adaptively and that they bound
mutual information in independent samples.
Lemma 5.3.8. Let A : X n → A0 and A0 : A0 × X → A1 be ε and ε0 -KL-stable algorithms,
respectively. Then the (randomized) composition A0 ◦ A(xn1 ) = A0 (A(xn1 ), xn1 ) is ε + ε0 -KL-stable.
Moreover, the pair (A0 ◦ A(xn1 ), A(xn1 )) is ε + ε0 -KL-stable.
Proof Let Ai and A0i be the promised sub-algorithms in Definition 5.1. We apply the data
processing inequality, which implies for each i that
Dkl A0 (A(xn1 ), xn1 )||A0i (Ai (x\i ), x\i ) ≤ Dkl A0 (A(xn1 ), xn1 ), A(xn1 )||A0i (Ai (x\i ), x\i ), Ai (x\i ) .
 

We require a bit of notational trickery now. Fixing i, let PA,A0 be the joint distribution of
A0 (A(xn1 ), xn1 ) and A(xn1 ) and QA,A0 the joint distribution of A0i (Ai (x\i ), x\i ) and Ai (x\i ), so that
they are both distributions over A1 × A0 . Let PA0 |a be the distribution of A0 (t, xn1 ) and similarly
QA0 |a is the distribution of A0i (t, x\i ). Note that A0 , A0i both “observe” x, so that using the chain
rule (2.1.6) for KL-divergences, we have
Dkl A0 ◦ A, A||A0i ◦ Ai , Ai = Dkl PA,A0 ||QA,A0
 
Z

= Dkl (PA ||QA ) + Dkl PA0 |t ||QA0 |t dPA (t)

= Dkl (A||Ai ) + EA [Dkl A0 (A, xn1 )||A0i (A, xn1 ) ].




Summing this from i = 1 to n yields


n n  Xn 
1X 0 0
 1X 1
Dkl A (A, x1 )||Ai (A, x1 ) ≤ ε + ε0 ,
0 n 0 n

Dkl A ◦ A||Ai ◦ Ai ≤ Dkl (A||Ai ) + EA
n n n
i=1 i=1 i=1

as desired.

The second key result is that KL-stable algorithms also bound the mutual information of a
random function.
Lemma 5.3.9. Let Xi be independent. Then for any random variable A,
Xn n Z
X
n
Dkl A(xn1 )||Ai (x\i ) dP (xn1 ),

I(A; X1 ) ≤ I(A; Xi | X\i ) =
i=1 i=1

where Ai (x\i ) = A(xi−1 n


1 , Xi , xi+1 ) is the random realization of A conditional on X\i = x\i .

120
Lexture Notes on Statistics and Information Theory John Duchi

Proof Without loss of generality, we assume A and X are both discrete. In this case, we have
Xn Xn
i−1
n
I(A; X1 ) = I(A; Xi | X1 ) = H(Xi | X1i−1 ) − H(Xi | A, X1i−1 ).
i=1 i=1

Now, because the Xi follow a product distribution, H(Xi | X1i−1 ) = H(Xi ), while H(Xi |
A, X1i−1 ) ≥ H(Xi | A, X\i ) because conditioning reduces entropy. Consequently, we have
n
X n
X
I(A; X1n ) ≤ H(Xi ) − H(Xi | A, X\i ) = I(A; Xi | X\i ).
i=1 i=1
To see the final equality, note that
Z
I(A; Xi | X\i ) = I(A; Xi | X\i = x\i )dP (x\i )
X n−1
Z Z
= Dkl (A(xn1 )||A(x1:i−1 , Xi , xi+1:n )) dP (xi )dP (x\i )
X n−1 X

by definition of mutual information as I(X; Y ) = EX [Dkl PY |X ||PY ].

Combining Lemmas 5.3.8 and 5.3.9, we see (nearly) immediately that KL stability implies
a mutual information bound, and consequently even interactive KL-stable algorithms maintain
bounds on mutual information.
Proposition 5.3.10. Let A1 , . . . , Ak be εi -KL-stable procedures, respectively, composed in any
arbitrary sequence. Let Xi be independent. Then
k
1 X
I(A1 , . . . , Ak ; X1n ) ≤ εi .
n
i=1
Proof Applying Lemma 5.3.9,
n k X
n
I(Aj ; Xi | X\i , Aj−1
X X
I(Ak1 ; X1n ) ≤ I(Ak1 ; Xi | X\i ) = 1 ).
i=1 j=1 i=1

Fix an index j and for shorthand, let A = A and A0


= (A1 , . . . , Aj−1 ) be the first j − 1 procedures.
Then expanding the final mutual information term and letting ν denote the distribution of A0 , we
have
Z
I(A; Xi | X\i , A0 ) = Dkl A(a0 , xn1 )||A(a0 , x\i ) dP (xi | A0 = a0 , x\i )dP n−1 (x\i )dν(a0 | x\i )


where A(a0 , xn1 ) is the (random) procedure A on inputs xn1 and a0 , while A(a0 , x\i ) denotes the
(random) procedure A on input a0 , x\i , Xi , and where the ith example Xi follows its disdtribution
conditional on A0 = a0 and X\i = x\i , as in Lemma 5.3.9. We then recognize that for each i, we
have
Z Z  
Dkl A(a , x1 )||A(a , x\i ) dP (xi | a , x\i ) ≤ Dkl A(a0 , xn1 )||A(a
0 n 0 0 e 0 , x\i ) dP (xi | a0 , x\i )


for any randomized function A, e as the marginal A in the lemma minimizes the average KL-
divergence (recall Exercise 2.15). Now, sum over i and apply the definition of KL-stability as
in Lemma 5.3.8.

121
Lexture Notes on Statistics and Information Theory John Duchi

5.3.4 Error bounds for a simple noise addition scheme


Based on Proposition 5.3.10, to build an appropriately well-generalizing procedure we must build
a mechanism for the interaction in Fig. 5.1 that maintains KL-stability. Using Example 5.3.6, this
is not challenging for the class of bounded queries. Let Φ = {φt }t∈T where φt : X → [−1, 1] be
the collection of statistical queries taking values in [−1, 1]. Then based on Proposition 5.3.10 and
Example 5.3.6, the following procedure is stable.

Input: Sample X1n ∈ X n drawn i.i.d. P , collection {φt }t∈T of possible queries φt : X →
[−1, 1]
Repeat: for k = 1, 2, . . .

i. Analyst chooses index Tk ∈ T and query φ := φTk

ii. Mechanism draws independent Zk ∼ N(0, σ 2 ) and responds with answer


n
1X
Ak := Pn φ + Zk = φ(Xi ) + Zk .
n
i=1

Figure 5.2: Sequential Gaussian noise mechanism.

This procedure is evidently KL-stable, and based on Example 5.3.6 and Proposition 5.3.10, we
have that
1 k
I(X1n ; T1 , . . . , Tk , Tk+1 ) ≤ 2 2
n 2σ n
so long as the indices Ti ∈ T are chosen only as functions of Pn φ + Zj for j < i, as the classical
information processing inequality implies that
1 1
I(X1n ; T1 , . . . , Tk , Tk+1 ) ≤ I(X1n ; A1 , . . . , Ak )
n n
because we have X1n → A1 → T2 and so on for the remaining indices. With this, we obtain the
following theorem.
Theorem 5.3.11. Let the indices Ti , i = 1, . . . , k + 1 be chosen in an arbitrary way using the
procedure 5.2, and let σ 2 > 0. Then
 
2 2ek 10
E max(Aj − P φTj ) ≤ 2 2 + + 4σ 2 (log k + 1).
j≤k σ n 4n
p
By inspection, we can optimize over σ 2 by setting σ 2 = k/(log k + 1)/n, which yields the
upper bound p
 
2 10 k(1 + log k)
E max(Aj − P φTj ) ≤ + 10 .
j≤k 4n n
Comparing to Example 5.3.4, we see a substantial improvement. While we do not achieve accuracy
scaling with log k, as we would if the queried functionals φt were completely independent of the
sample, we see that we achieve mean-squared error of order

k log k
n

122
Lexture Notes on Statistics and Information Theory John Duchi

for k adaptively chosen queries.


Proof To prove the result, we use a technique sometimes called the monitor technique. Roughly,
the idea is that we can choose the index Tk+1 in any way we desire as long as it is a function of the
answers A1 , . . . , Ak and any other constants independent of the data. Thus, we may choose

Tk+1 := Tk? where k ? = argmax{|Aj − P φTj |},


j≤k

as this is a (downstream) function of the k different ε = 2σ12 n2 -KL-stable queries T1 , . . . , Tk . As


a consequence, we have from Corollary 5.3.3 (and the fact that the queries φ are 1-sub-Gaussian)
that for T = Tk+1 ,
2e 5 5 ek 5
E[(Pn φT − P φT )2 ] ≤ I(X1n ; Tk+1 ) + ≤ 2ekε + = 2 2+ .
n 4n 4n σ n 4n
Now, we simply consider the independent noise addition, noting that (a + b)2 ≤ 2a2 + 2b2 for any
a, b ∈ R, so that
   
2 2 2
E max(Aj − P φTj ) ≤ 2E[(Pn φT − P φT ) ] + 2E max{Zj }
j≤k j≤k
2ek 10
≤ + + 4σ 2 (log k + 1), (5.3.3)
σ 2 n2 4n
where inequality (5.3.3) is the desired result and follows by the following lemma.
Lemma 5.3.12. Let Wj , j = 1, . . . , k be independent N(0, 1). Then E[maxj Wj2 ] ≤ 2(log k + 1).
Proof We assume that k ≥ 3, as the result is trivial otherwise. Using the tail bound for
Gaussians (Mills’s ratio for Gaussians, which is tighter Rthan the standard sub-Gaussian bound)
2 ∞
that P(W ≥ t) ≤ √2πt 1
e−t /2 for t ≥ 0 and that E[Z] = 0 P(Z ≥ t)dt for a nonnegative random
variable Z, we obtain that for any t0 ,
Z ∞ Z ∞
E[max Wj2 ] = P(max Wj2 ≥ t)dt ≤ t0 + P(max Wj2 ≥ t)dt
j 0 j t0 j
Z ∞ √ Z ∞
2k 4k
≤ t0 + 2k P(W1 ≥ t)dt ≤ t0 + √ e−t/2 dt = t0 + √ e−t0 /2 .
t0 2π t0 2π

Setting t0 = 2 log(4k/ 2π) gives E[maxj Wj2 ] ≤ 2 log k + log √42π + 1.

5.4 Bibliography and further reading


PAC-Bayes techniques originated with work of David McAllester [135, 136, 137], and we remark
on his excellently readable tutorial [138]. The particular approaches we take to our proofs in
Section 5.2 follow Catoni [44] and McAllester [137]. The PAC-Bayesian bounds we present, that
simultaneously for any distribution π on F, if F ∼ π then
 
2 n 1 1
E[(Pn F − P F ) | X1 ] . Dkl (π||π0 ) + log
n δ

123
Lexture Notes on Statistics and Information Theory John Duchi

with probability at least 1 − δ suggest that we can optimize them by choosing π carefully. For
example, in the context of learning a statistical model parameterized by θ ∈ Θ with losses `(θ; x, y),
it is natural to attempt to find π minimizing
r
1
Eπ [Pn `(θ; X, Y ) | Pn ] + C Dkl (π||π0 )
n
in π, where the expectation is taken over θ ∼ π. If this quantity has optimal value ?n ,qthen one is

immediately guaranteed that for the population P , we have Eπ [P `(θ; X, y)] ≤ n + C log 1δ / n.
?

Langford and Caruana [126] take this approach, and Dziugaite and Roy [79] use it to give (the
first) non-trivial bounds for deep learning models.
The questions of interactive data analysis begin at least several decades ago, perhaps most pro-
foundly highlighted positively by Tukey’s Exploratory Data Analysis [168]. Problems of scientific
replicability have, conversely, highlighted many of the challenges of reusing data or peeking, even
innocently, at samples before performing statistical analyses [113, 86, 91]. Our approach to for-
malizing these ideas, and making rigorous limiting information leakage, draws from a more recent
strain of work in the theoretical computer science literature, with major contributions from Dwork,
Feldman, Hardt, Pitassi, Reingold, and Roth and Bassily, Nissim, Smith, Steinke, Stemmer, and
Ullman [78, 76, 77, 20, 21]. Our particular treatment most closely follows Feldman and Steinke [82].
The problems these techniques target also arise frequently in high-dimensional statistics, where one
often wishes to estimate uncertainty and perform inference after selecting a model. While we do
not touch on these problems, a few references in this direction include [25, 166, 109].

5.5 Exercises
Exercise 5.1 (Duality in Donsker-Varadhan): Here, we give a converse result to Theorem 5.1.1,
showing that for any function h : X → R,

log EQ [eh(X) ] = sup {EP [h(X)] − Dkl (P ||Q)} , (5.5.1)


P

where the supremum is taken over probability measures. If Q has a density, the supremum may be
taken over probability measures having a density.

(a) Show the equality (5.5.1) in the case that X is discrete by directly computing the supremum.
(That is, let |X | = k, and identify probability measures P and Q with vectors p, q ∈ Rk+ .)

(b) Let Q have density q. Assume that EQ [eh(X) ] < ∞ and let

Zh (x) = exp(h(x))/EQ [exp(h(X))],

so EQ [Zh (X)] = 1. Let P have density p(x) = Zh (x)q(x). Show that

log EQ [eh(X) ] = EP [h(X)] − Dkl (P ||Q) .

Why does this imply equality (5.5.1) in this case?

(c) If EQ [eh(X) ] = +∞, then monotone convergence implies that limB↑∞ EQ [emin{B,h(X)} ] = +∞.
Conclude (5.5.1).

124
Lexture Notes on Statistics and Information Theory John Duchi

Exercise 5.2 (An alternative PAC-Bayes bound): Let f : Θ × X → R, and let π0 be a density
on θ ∈ Θ. Use the dual form (5.5.1) of the variational representation of the KL-divergence show
iid
that with probability at least 1 − δ over the draw of X1n ∼ P ,

Dkl (π||π0 ) + log 1δ


Z Z
Pn f (θ, X)π(θ)dθ ≤ log EP [exp(f (θ, X))] π(θ)dθ +
n
simultaneously for all distributions π on Θ, where the expectation EP is over X ∼ P .
Exercise 5.3 (A mean estimator with sub-Gaussian concentration for a heavy-tailed distribu-
tion [45]): In this question, we use a PAC-Bayes bound to construct an estimator of the mean E[X]
of a distribution with sub-Gaussian-like concentration that depends only on the second moments
Σ = E[XX > ] of the random vector X (not on any additional dimension-dependent quantitites)
while only assuming that E[kXk2 ] < ∞. Let ψ be an odd function (i.e., ψ(−t) = −ψ(t)) satisfying

− log(1 − t + t2 ) ≤ ψ(t) ≤ log(1 + t + t2 ).

The function ψ(t) = min{1, max{−1, t}} (the truncation of t to the range [−1, 1]) is such a function.
Let πθ be the normal distribution N(θ, σ 2 I) and π0 be N(0, σ 2 I).
(a) Let λ > 0. Use Exercise 5.2 to show that with probability at least 1 − δ, for all θ ∈ Rd

1
Z   kθk2 /2σ 2 + log 1
Pn ψ(λhθ0 , Xi)πθ (θ0 )dθ0 ≤ hθ, E[X]i + λ θ> Σθ + σ 2 tr(Σ) + 2 δ
.
λ nλ

(b) For λ > 0, define the “directional mean” estimator


Z
1
En (θ, λ) = Pn ψ(λhθ0 , Xi)πθ (θ0 )dθ0 .
λ
Give a choice of λ > 0 such that with probability 1 − δ,
s 
2 1 1  
2 tr(Σ) ,
sup |En (θ, λ) − hθ, E[X]i| ≤ √ + log kΣk op + σ
θ∈Sd−1 n 2σ 2 δ

where Sd−1 = {u ∈ Rd | kuk2 = 1} is the unit sphere.

(c) Justify the following statement: choosing the vector µ


bn minimizing

sup |En (θ, λ) − hθ, µi|


θ∈Sd−1

in µ guarantees that with probability at least 1 − δ,


s 
4 1 1  
2 tr(Σ) .
µn − E[X]k2 ≤ √
kb + log kΣkop + σ
n 2σ 2 δ

(d) Give a choice of the prior/posterior variance σ 2 so that


r
4 1
µn − E[X]k2 ≤ √
kb tr(Σ) + 2 kΣkop log
n δ
with probability at least 1 − δ.

125
Lexture Notes on Statistics and Information Theory John Duchi

Exercise 5.4 (Large-margin PAC-Bayes bounds for multiclass problems): Consider the following
multiclass prediction scenario. Data comes in pairs (x, y) ∈ bBd2 × [k] where Bd2 = {v ∈ Rd | kvk2 ≤
1} denotes the `2 -ball and [k] = {1, . . . , k}. We make predictions using predictors θ1 , . . . , θk ∈ Rd ,
where the prediction of y on an example x is

yb(x) := argmaxhθi , xi.


i≤k

We suffer an error whenever yb(x) 6= y, and the margin of our classifier on pair (x, y) is

hθy , xi − maxhθi , xi = minhθy − θi , xi.


i6=y i6=y

If hθy , xi > hθi , xi for all i 6= y, the margin is then positive (and the prediction is correct).

(a) Develop an analogue of the bounds in Section 5.2.2 in this k-class multiclass setting. To do
so, you should (i) define the analogue of the margin-based loss `γ , (ii) show how Gaussian
perturbations leave it similar, and (iii) prove an analogue of the bound in Section 5.2.2. You
should assume one of the two conditions
k
X
(C1) kθi k2 ≤ r for all i (C2) kθi k22 ≤ kr2
i=1

on your classification vectors θi . Specify which condition you choose.

(b) Describe a minimization procedure—just a few lines suffice—that uses convex optimization to
find a (reasonably) large-margin multiclass classifier.

Exercise 5.5 (A variance-based information bound): Let Φ = {φt }t∈T be a collection of functions
φt : X → R, where each φt satisfies the Bernstein condition (4.1.7) with parameters σ 2 (φt ) and b,
that is, |E[(φt (X) − P φt (X))k ]| ≤ k! 2
2 σ (φt )b
k−2 for all k ≥ 3 and Var(φ (X)) = σ 2 (φ ). Let T ∈ T
t t
be any random variable, which may depend on an observed sample X1n . Show that for all C > 0
C
and |λ| ≤ 2b , then  
Pn φT − P φT 1
E ≤ I(T ; X1n ) + |λ|.
max{C, σ(φT )} n|λ|

Exercise 5.6 (An information bound on variance): Let Φ = {φt }t∈T be a collection of functions
φt : X → R, where each φt : X → [−1, 1]. Let σ 2 (φt ) = Var(φt (X)). Let s2n (φ) = Pn φ2 − (Pn φ)2 be
the sample variance of φ. Show that for all C > 0 and 0 ≤ λ ≤ C/4, then

s2n (φT )
 
1
E ≤ I(T ; X1n ) + 2.
max{C, σ 2 (φT )} nλ

The max{C, σ 2 (φT )} term is there to help avoid division by 0. Hint: If 0 ≤ x ≤ 1, then
ex ≤ 1 + 2x, and if X ∈ [0, 1], then E[eX ] ≤ 1 + 2E[X] ≤ e2E[X] . Use this to argue that
2 2
E[eλnPn (φ−P φ) / max{C,σ } ] ≤ e2λn for any φ : X → [−1, 1] with Var(φ) ≤ σ 2 , then apply the
Donsker-Varadhan theorem.
Exercise 5.7: Consider the following scenario: let φ : X → [−1, 1] and let α > 0, τ > 0. Let
µ = Pn φ and s2 = Pn φ2 − µ2 . Define σ 2 = max{αs2 , τ 2 }, and assume that τ 2 ≥ 5α
n .

126
Lexture Notes on Statistics and Information Theory John Duchi

(a) Show that the mechanism with answer Ak defined by


A := Pn φ + Z for Z ∼ N(0, σ 2 )
is ε-KL-stable (Definition 5.1), where for a numerical constant C < ∞,
s2 α2
 
ε≤C · 2 2 · 1+ 2 .
n σ σ

(b) Show that if α2 ≤ C 0 τ 2 for a numerical constant C 0 < ∞, then we can take ε ≤ O(1) n21α .
Hint: Use exercise 2.14, and consider the “alternative” mechanisms of sampling from
2 2
N(µ−i , σ−i ) where σ−i = max{αs2−i , τ 2 }
for
1 X 1 X
µ−i = φ(Xj ) and s2−i = φ(Xj )2 − µ2−i .
n−1 n−1
j6=i j6=i

Input: Sample X1n ∈ X n drawn i.i.d. P , collection {φt }t∈T of possible queries φt : X →
[−1, 1], parameters α > 0 and τ > 0
Repeat: for k = 1, 2, . . .

i. Analyst chooses index Tk ∈ T and query φ := φTk

ii. Set s2k := Pn φ2 − (Pn φ)2 and σk2 := max{αs2k , τ 2 }

iii. Mechanism draws independent Zk ∼ N(0, σk2 ) and responds with answer
n
1X
Ak := Pn φ + Zk = φ(Xi ) + Zk .
n
i=1

Figure 5.3: Sequential Gaussian noise mechanism with variance sensitivity.

Exercise 5.8 (A general variance-dependent bound on interactive queries): Consider the algo-
rithm in Fig. 5.3. Let σ 2 (φt ) = Var(φt (X)) be the variance of φt .
(a) Show that for b > 0 and for all 0 ≤ λ ≤ 2b ,
r
|Aj − P φTj | τ2
 
1 n k
p 4α
E max ≤ I(X1 ; T1 ) + λ + 2 log(ke) 2
I(X1n ; T1k ) + 2α + 2 .
j≤k max{b, σ(φTj )} nλ nb b
(If you do not have quite the right constants, that’s fine.)
(b) Using the result of Question 5.7, show that with appropriate choices for the parameters
α, b, τ 2 , λ that for a numerical constant C < ∞
" #
|Aj − P φTj | (k log k)1/4
E max √ ≤ C √ .
j≤k max{(k log k)1/4 / n, σ(φTj )} n
You may assume that k, n are large if necessary.
(c) Interpret the result from part (b). How does this improve over Theorem 5.3.11?

127
Chapter 6

Advanced techniques in concentration


inequalities

6.1 Entropy and concentration inequalities


In the previous sections, we saw how moment generating functions and related techniques could
be used to give bounds on the probability of deviation for fairly simple quantities, such as sums of
random variables. In many situations, however, it is desirable to give guarantees for more complex
functions. As one example, suppose that we draw a matrix X ∈ Rm×n , where the entries of X are
bounded independent random variables. The operator norm of X, |||X||| := supu,v {u> Xv : kuk2 =
kvk2 = 1}, is one measure of the size of X. We would like to give upper bounds on the probability
that |||X||| ≥ E[|||X|||] + t for t ≥ 0, which the tools of the preceding sections do not address well
because of the complicated dependencies on |||X|||.
In this section, we will develop techniques to give control over such complex functions. In
particular, throughout we let Z = f (X1 , . . . , Xn ) be some function of a sample of independent
random variables Xi ; we would like to know if Z is concentrated around its mean. We will use
deep connections between information theoretic quantities and deviation probabilities to investigate
these connections.
First, we give a definition.
Definition 6.1. Let φ : R → R be a convex function. The φ-entropy of a random variable X is

Hφ (X) := E[φ(X)] − φ(E[X]), (6.1.1)

assuming the relevant expectations exist.


A first example of the φ-entropy is the variance:
Example 6.1.1 (Variance as φ-entropy): Let φ(t) = t2 . Then Hφ (X) = E[X 2 ] − E[X]2 =
Var(X). 3
This example is suggestive of the fact that φ-entropies may help us to control deviations of random
variables from their means. More generally, we have by Jensen’s inequality that Hφ (X) ≥ 0 for
any convex φ; moreover, if φ is strictly convex and X is non-constant, then Hφ (X) > 0. The
rough intuition we consider throughout this section is as follows: if a random variable X is tightly
concentrated around its mean, then we should have X ≈ E[X] “most” of the time, and so Hφ (X)
should be small. The goal of this section is to make this claim rigorous.

128
Lexture Notes on Statistics and Information Theory John Duchi

6.1.1 The Herbst argument


Perhaps unsurprisingly given the focus of these lecture notes, we focus on a specific φ, using
φ(t) = t log t, which gives the entropy on which we focus:

H(Z) := E[Z log Z] − E[Z] log E[Z], (6.1.2)

defined whenever Z ≥ 0 with probability 1. As our particular focus throughout this chapter, we
consider the moment generating function and associated transformation X 7→ eλX . If we know the
moment generating function ϕX (λ) := E[eλX ], then ϕ0X (λ) = E[XeλX ], and so

H(eλX ) = λϕ0X (λ) − ϕX (λ) log ϕX (λ).

This suggests—in a somewhat roundabout way we make precise—that control of the entropy H(eλX )
should be sufficient for controlling the moment generating function of X.
The Herbst argument makes this rigorous.

Proposition 6.1.2. Let X be a random variable and assume that there exists a constant σ 2 < ∞
such that
λ2 σ 2
H(eλX ) ≤ ϕX (λ). (6.1.3)
2
for all λ ∈ R (respectively, λ ∈ R+ ) where ϕX (λ) = E[eλX ] denotes the moment generating function
of X. Then  2 2
λ σ
E[exp(λ(X − E[X]))] ≤ exp
2
for all λ ∈ R (respectively, λ ∈ R+ ).

Proof Let ϕ = ϕX for shorthand. The proof procedes by an integration argument, where we
2 2
show that log ϕ(λ) ≤ λ 2σ . First, note that

ϕ0 (λ) = E[XeλX ],

so that inequality (6.1.3) is equivalent to

λ2 σ 2
λϕ0 (λ) − ϕ(λ) log ϕ(λ) = H(eλX ) ≤ ϕ(λ),
2
and dividing both sides by λ2 ϕ(λ) yields the equivalent statement

ϕ0 (λ) 1 σ2
− 2 log ϕ(λ) ≤ .
λϕ(λ) λ 2

But by inspection, we have

∂ 1 ϕ0 (λ) 1
log ϕ(λ) = − 2 log ϕ(λ).
∂λ λ λϕ(λ) λ

Moreover, we have that

log ϕ(λ) log ϕ(λ) − log ϕ(0) ϕ0 (0)


lim = lim = = E[X].
λ→0 λ λ→0 λ ϕ(0)

129
Lexture Notes on Statistics and Information Theory John Duchi

Integrating from 0 to any λ0 , we thus obtain


Z λ0  Z λ0 2
σ 2 λ0

1 ∂ 1 σ
log ϕ(λ0 ) − E[X] = log ϕ(λ) dλ ≤ dλ = .
λ0 0 ∂λ λ 0 2 2
Multiplying each side by λ0 gives
σ 2 λ20
log E[eλ0 (X−E[X]) ] = log E[eλ0 X ] − λ0 E[X] ≤ ,
2
as desired.

It is possible to give a similar argument for sub-exponential random variables, which allows us
to derive Bernstein-type bounds, of the form of Corollary 4.1.18, but using the entropy method. In
particular, in the exercises, we show the following result.
Proposition 6.1.3. Assume that there exist positive constants b and σ such that

H(eλX ) ≤ λ2 bϕ0X (λ) + ϕX (λ)(σ 2 − bE[X])


 
(6.1.4a)

for all λ ∈ [0, 1/b). Then X satisfies the sub-exponential bound


σ 2 λ2
log E[eλ(X−E[X]) ] ≤ (6.1.4b)
[1 − bλ]+
for all λ ≥ 0.
An immediate consequence of this proposition is that any random variable satisfying the entropy
bound (6.1.4a) is (2σ 2 , 2b)-sub-exponential. As another immediate consequence, we obtain the
concentration guarantee
  2 
1 t t
P(X ≥ E[X] + t) ≤ exp − min ,
4 σ2 b
as in Proposition 4.1.16.

6.1.2 Tensorizing the entropy


A benefit of the moment generating function approach we took in the prequel is the excellent
behavior of the moment generating function for sums. In particular, the fact that ϕX1 +···+Xn (λ) =
Q n
i=1 ϕXi (λ) allowed us to derive sharper concentration inequalities, and we were only required to
work with marginal distributions of the Xi , computing only the moment generating functions of
individual random variables rather than characteristics of the entire sum. One advantage of the
entropy-based tools we develop is that they allow similar tensorization—based on the chain rule
identities of Chapter 2 for entropy, mutual information, and KL-divergence—for substantially more
complex functions. Our approach here mirrors that of Boucheron, Lugosi, and Massart [34].
With that in mind, we now present a series of inequalities that will allow us to take this approach.
For shorthand throughout this section, we let

X\i = (X1 , . . . , Xi−1 , Xi+1 , . . . , Xn )

be the collection of all variables except Xi . Our first result is a consequence of the chain rule for
entropy and is known as Han’s inequality.

130
Lexture Notes on Statistics and Information Theory John Duchi

Proposition 6.1.4 (Han’s inequality). Let X1 , . . . , Xn be discrete random variables. Then


n
1 X
H(X1n ) ≤ H(X\i ).
n−1
i=1

Proof The proof is a consequence of the chain rule for entropy and that conditioning reduces
entropy. We have

H(X1n ) = H(Xi | X\i ) + H(X\i ) ≤ H(Xi | X1i−1 ) + H(X\i ).

Writing this inequality for each i = 1, . . . , n, we obtain


n
X n
X n
X
nH(X1n ) ≤ H(X\i ) + H(Xi | X1i−1 ) = H(X\i ) + H(X1n ),
i=1 i=1 i=1

and subtracting H(X1n ) from both sides gives the result.

We also require a divergence version of Han’s inequality, which will allow us to relate the entropy
H of a random variable to divergences and other information-theoretic quantities. Let X be an
arbitrary space, and let Q be a distribution over X n and P = P1 ×· · ·×Pn be a product distribution
on the same space. For A ⊂ X n−1 , define the marginal densities

Q(i) (A) := Q(X\i ∈ A) and P (i) (A) = P (X\i ∈ A).

We then obtain the tensorization-type Han’s inequality for relative entropies.


Proposition 6.1.5. With the above definitions,
n h
X  i
Dkl (Q||P ) ≤ Dkl (Q||P ) − Dkl Q(i) ||P (i) .
i=1

Proof We have seen earlier in the notes (recall the definition (2.2.1) of the KL divergence as
a supremum over all quantizers and the surrounding discussion) that it is no loss of generality to
assume that X is discrete. Thus, noting that the probability mass functions
X Y
q (i) (x\i ) = q(xi−1 n (i)
1 , x, xi+1 ) and p (x\i ) = pj (xj ),
x j6=i

we have that Han’s inequality (Proposition 6.1.4) is equivalent to


X n X
X
(n − 1) q(xn1 ) log q(xn1 ) ≥ q (i) (x\i ) log q (i) (x\i ).
xn
1 i=1 x\i

Now, by subtracting q(xn1 ) log p(xn1 ) from both sides of the preceding display, we obtain
X X
(n − 1)Dkl (Q||P ) = (n − 1) q(xn1 ) log q(xn1 ) − (n − 1) q(xn1 ) log p(xn1 )
xn
1 xn
1
n X
X X
≥ q (i) (x\i ) log q (i) (x\i ) − (n − 1) q(xn1 ) log p(xn1 ).
i=1 x\i xn
1

131
Lexture Notes on Statistics and Information Theory John Duchi

We expand the final term. Indeed, by the product nature of the distributions p, we have

X X n
X
(n − 1) q(xn1 ) log p(xn1 ) = (n − 1) q(xn1 ) log pi (xi )
xn
1 xn
1 i=1
n X
X X n X
X
= q(xn1 ) log pi (xi ) = q (i) (x\i ) log p(i) (x\i ).
i=1 xn
1 j6=i i=1 x\i
| {z }
=log p(i) (x\i )

Noting that
X X  
q (i) (x\i ) log q (i) (x\i ) − q (i) (x\i ) log p(i) (x\i ) = Dkl Q(i) ||P (i)
x\i x\i

and rearranging gives the desired result.

Finally, we will prove the main result of this subsection: a tensorization identity for the entropy
H(Y ) for an arbitrary random variable Y that is a function of n independent random variables.
For this result, we use a technique known as tilting, in combination with the two variants of Han’s
inequality we have shown, to obtain the result. The tilting technique is one used to transform
problems of random variables into one of distributions, allowing us to bring the tools of information
and entropy to bear more directly. This technique is a common one, and used frequently in
large deviation theory, statistics, for heavy-tailed data, amont other areas. More concretely, let
Y = f (X1 , . . . , Xn ) for some non-negative function f . Then we may always define a tilted density

f (x1 , . . . , xn )p(x1 , . . . , xn )
q(x1 , . . . , xn ) := (6.1.5)
EP [f (X1 , . . . , Xn )]

which, by inspection, satisfies q(xn1 ) = 1 and q ≥ 0. In our context, if f ≈ constant under the
R

distribution P , then we should have f (xn1 )p(xn1 ) ≈ cp(xn1 ) and so Dkl (Q||P ) should be small; we
can make this rigorous via the following tensorization theorem.

Theorem 6.1.6. Let X1 , . . . , Xn be independent random variables and Y = f (X1n ), where f is a


non-negative function. Define H(Y | X\i ) = E[Y log Y | X\i ]. Then
n
X 
H(Y ) ≤ E H(Y | X\i ) . (6.1.6)
i=1

Proof Inequality (6.1.6) holds for Y if and only if holds identically for cY for any c > 0, so
we assume without loss of generality that EP [Y ] = 1. We thus obtain that H(Y ) = E[Y log Y ] =
E[φ(Y )], where assign φ(t) = t log t. Let P have density p with respect to a base measure µ. Then
by defining the tilted distribution (density) q(xn1 ) = f (xn1 )p(xn1 ), we have Q(X n ) = 1, and moreover,
we have
q(xn1 )
Z Z
n
Dkl (Q||P ) = q(x1 ) log dµ(x1 ) = f (xn1 )p(xn1 ) log f (xn1 )dµ(xn1 ) = EP [Y log Y ] = H(Y ).
n
p(xn1 )

132
Lexture Notes on Statistics and Information Theory John Duchi

Similarly, if φ(t) = t log t, then


 
(i) (i)
Dkl Q ||P
p(i) (x\i ) f (x1i−1 , x, xni+1 )pi (x)dµ(x) (i)
Z Z  R
i−1 n
= f (x1 , x, xi+1 )pi (x)dµ(x) log p (x\i )dµ(x\i )
X n−1 p(i) (x\i )
Z
= E[Y | x\i ] log E[Y | x\i ]p(i) (x\i )dµ(x\i )
X n−1
= E[φ(E[Y | X\i ])].

The tower property of expectations then yields that

E[φ(Y )] − E[φ(E[Y | X\i ])] = E[E[φ(Y ) | X\i ] − φ(E[Y | X\i ])] = E[H(Y | X\i )].

Using Han’s inequality for relative entropies (Proposition 6.1.4) then immediately gives
n h
X  n
i X
H(Y ) = Dkl (Q||P ) ≤ Dkl (Q||P ) − Dkl Q(i) ||P (i) = E[H(Y | X\i )],
i=1 i=1

which is our desired result.

Theorem 6.1.6 shows that if we can show that individually the conditional entropies H(Y | X\i )
are not too large, then the Herbst argument (Proposition 6.1.2 or its variant Proposition 6.1.3)
allows us to provide strong concentration inequalities for general random variables Y .

Examples and consequences


We now show how to use some of the preceding results to derive strong concentration inequalities,
showing as well how we may give convergence guarantees for a variety of procedures using these
techniques.
We begin with our most straightforward example, which is the bounded differences inequality.
In particular, we consider an arbitrary function f of n independent random variables, and we
assume that for all x1:n = (x1 , . . . , xn ), we have the bounded differences condition:

sup f (x1 , . . . , xi−1 , x, xi+1 , . . . , xn ) − f (x1 , . . . , xi−1 , x0 , xi+1 , . . . , xn ) ≤ ci for all x\i .
x∈X ,x0 ∈X
(6.1.7)
Then we have the following result.

Proposition 6.1.7 (Bounded differences). Assume that f satisfies the bounded differences condi-
1 Pn
tion (6.1.7), where 4 i=1 ci ≤ σ 2 . Let Xi be independent. Then Y = f (X1 , . . . , Xn ) is σ 2 -sub-
2

Gaussian.

Proof We use a similar integration argument to the Herbst argument of Proposition 6.1.2, and
we apply the tensorization inequality (6.1.6). First, let U be an arbitrary random variable taking
values in [a, b]. We claim that if ϕU (λ) = E[eλU ] and ψ(λ) = log ϕU (λ) is its cumulant generating
function, then
H(eλU ) λ2 (b − a)2
≤ . (6.1.8)
E[eλU ] 8

133
Lexture Notes on Statistics and Information Theory John Duchi

To see this, note that


λ
λ2 (b − a)2
Z

[λψ 0 (λ) − ψ(λ)] = ψ 00 (λ), so λψ 0 (λ) − ψ(λ) = tψ 00 (t)dt ≤ ,
∂λ 0 8
where we have used the homework exercise XXXX (recall Hoeffding’s Lemma, Example 4.1.6), to
2
argue that ψ 00 (t) ≤ (b−a)
4 for all t. Recalling that

H(eλU ) = λϕ0U (λ) − ϕU (λ)ψ(λ) = λψ 0 (λ) − ψ(λ) ϕU (λ)


 

gives inequality (6.1.8).


Now we apply the tensorization identity. Let Z = eλY . Then we have
n n n
c2i λ2 c2i λ2
X  X  X
λZ
H(Z) ≤ E H(Z | X\i ) ≤ E E[e | X\i ] = E[eλZ ].
8 8
i=1 i=1 i=1

Applying the Herbst argument gives the final result.

As an immediate consequence of this inequality, we obtain the following dimension independent


concentration inequality.
Example 6.1.8: Let X1 , . . . , Xn be independent vectors in Rd , where d is arbitrary, and
assume that kXi k2 ≤ ci with probability 1. (This could be taken to be a general Hilbert space
with no loss of generality.) We claim that if we define
n n √ 2!
[t − σ]+
X  X 
σ 2 := c2i , then P Xi ≥ t ≤ exp −2 2
.
2 σ
i=1 i=1

Indeed, we have that Y = k ni=1 Xi k2 satisfies the bounded differences inequality with param-
P
eters ci , and so
 X n   X n Xn X n 
P Xi ≥ t = P Xi − E Xi ≥ t − E Xi
i=1 2 i=1 2 i=1 2 i=1 2

Ek i=1 Xi k2 ]2+
Pn !
[t −
≤ exp −2 Pn 2 .
i=1 ci

Pn q P qP
n n 2
i=1 E[kXi k2 ] gives the result. 3
2
Noting that E[k i=1 Xi k2 ] ≤ E[k i=1 Xi k2 ] =

6.1.3 Concentration of convex functions


We provide a second theorem on the concentration properties of a family of functions that are quite
useful, for which other concentration techniques do not appear to give results. In particular, we
say that a function f : Rn → R is separately convex if for each i ∈ {1, . . . , n} and all x\i ∈ Rn−1
(or the domain of f ), we have that
x 7→ f (x1 , . . . , xi−1 , x, xi+1 , . . . , xn )
is convex. We also recall that a function is L-Lipschitz if |f (x) − f (y)| ≤ kx − yk2 for all x, y ∈
Rn ; any L-Lipschitz function is almost everywhere differentiable, and is L-Lipschitz if and only if
k∇f (x)k2 ≤ L for (almost) all x. With these preliminaries in place, we have the following result.

134
Lexture Notes on Statistics and Information Theory John Duchi

Theorem 6.1.9. Let X1 , . . . , Xn be independent random variables with Xi ∈ [a, b] for all i. Assume
that f : Rn → R is separately convex and L-Lipschitz with respect to the k·k2 norm. Then

E[exp(λ(f (X1:n ) − E[f (X1:n )]))] ≤ exp λ2 (b − a)2 L2 for all λ ≥ 0.




We defer the proof of the theorem temporarily, giving two example applications. The first is to
the matrix concentration problem that motivates the beginning of this section.

Example 6.1.10: Let X ∈ Rm×n be a matrix with independent entries, where Xij ∈ [−1, 1]
for all i, j, and let |||·||| denote the operator norm on matrices, that is, |||A||| = supu,v {u> Av :
kuk2 ≤ 1, kvk2 ≤ 1}. Then Theorem 6.1.9 implies
 2
t
P(|||X||| ≥ E[|||X|||] + t) ≤ exp −
16

for all t ≥ 0. Indeed, we first observe that

| |||X||| − |||Y ||| | ≤ |||X − Y ||| ≤ kX − Y kFr ,

where k·kFr denotes the Frobenius norm of a matrix. Thus the matrix operator norm is 1-
Lipschitz. Therefore, we have by Theorem 6.1.9 and the Chernoff bound technique that

P(|||X||| ≥ E[|||X|||] + t) ≤ exp(4λ2 − λt)

for all λ ≥ 0. Taking λ = t/8 gives the desired result. 3

As a second example, we consider Rademacher complexity. These types of results are important
for giving generalization bounds in a variety of statistical algorithms, and form the basis of a variety
of concentration and convergence results. We defer further motivation of these ideas to subsequent
chapters, just mentioning here that we can provide strong concentration guarantees for Rademacher
complexity or Rademacher chaos.

Example 6.1.11: Let A ⊂ Rn be any collection of vectors. The the Rademacher complexity
of the class A is
n
" #
X
Rn (A) := E sup a i εi , (6.1.9)
a∈A i=1

bn (A) = supa∈A Pn ai εi denote the


where εi are i.i.d. Rademacher (sign) variables. Let R i=1
empirical version of this quantity. We claim that

t2
 
P(Rn (A) ≥ Rn (A) + t) ≤ exp −
b ,
16 diam(A)2

where diam(A) := supa∈A kak2 . Indeed, we have that ε 7→ supa∈A a> ε is a convex function,
as it is the maximum of a family of linear functions. Moreover, it is Lipschitz, with Lipschitz
constant bounded by supa∈A kak2 . Applying Theorem 6.1.9 as in Example 6.1.10 gives the
result. 3

Proof of Theorem 6.1.9 The proof relies on our earlier tensorization identity and a sym-
metrization lemma.

135
Lexture Notes on Statistics and Information Theory John Duchi

iid
Lemma 6.1.12. Let X, Y ∼ P be independent. Then for any function g : R → R, we have

H(eλg(X) ) ≤ λ2 E[(g(X) − g(Y ))2 eλg(X) 1 {g(X) ≥ g(Y )}] for λ ≥ 0.

Moreover, if g is convex, then

H(eλg(X) ) ≤ λ2 E[(X − Y )2 (g 0 (X))2 eλg(X) ] for λ ≥ 0.

Proof For the first result, we use the convexity of the exponential in an essential way. In
particular, we have

H(eλg(X) ) = E[λg(X)eλg(X) ] − E[eλg(X) ] log E[eλg(Y ) ]


≤ E[λg(X)eλg(X) ] − E[eλg(X) λg(Y )],

because log is concave and ex ≥ 0. Using symmetry, that is, that g(X) − g(Y ) has the same
distribution as g(Y ) − g(X), we then find
1
H(eλg(X) ) ≤ E[λ(g(X)−g(Y ))(eλg(X) −eλg(Y ) )] = E[λ(g(X)−g(Y ))(eλg(X) −eλg(Y ) )1 {g(X) ≥ g(Y )}].
2
Now we use the classical first order convexity inequality—that a convex function f satisfies f (t) ≥
f (s)+f 0 (s)(t−s) for all t and s, Theorem B.3.3 in the appendices—which gives that et ≥ es +es (t−s)
for all s and t. Rewriting, we have es − et ≤ es (s − t), and whenever s ≥ t, we have (s − t)(es − et ) ≤
es (s − t)2 . Replacing s and t with λg(X) and λg(Y ), respectively, we obtain

λ(g(X) − g(Y ))(eλg(X) − eλg(Y ) )1 {g(X) ≥ g(Y )} ≤ λ2 (g(X) − g(Y ))2 eλg(X) 1 {g(X) ≥ g(Y )} .

This gives the first inequality of the lemma.


To obtain the second inequality, note that if g is convex, then whenever g(x) − g(y) ≥ 0, we
have g(y) ≥ g(x) + g 0 (x)(y − x), or g 0 (x)(x − y) ≥ g(x) − g(y) ≥ 0. In particular,

(g(X) − g(Y ))2 1 {g(X) ≥ g(Y )} ≤ (g 0 (X)(X − Y ))2 ,

which gives the second result.

Returning to the main thread of the proof, we note that the separate convexity of f and the
tensorization identity of Theorem 6.1.6 imply
n n
X  X "  2 #
λf (X1:n ) λf (X1:n ) 2 2 ∂ λf (X1:n )
H(e )≤E H(e | X\i ) ≤ E λ E (Xi − Yi ) f (X1:n ) e | X\i ,
∂xi
i=1 i=1

where Yi are independent copies of the Xi . Now, we use that (Xi −Yi )2 ≤ (b−a)2 and the definition
of the partial derivative to obtain

H(eλf (X1:n ) ) ≤ λ2 (b − a)2 E[k∇f (X1:n )k22 eλf (X1:n )) ].

Noting that k∇f (X)k22 ≤ L2 , and applying the Herbst argument, gives the result.

136
Lexture Notes on Statistics and Information Theory John Duchi

Exercise 6.1 (A discrete isoperimetric inequality): Let A ⊂ Zd be a finite subset of the d-


dimensional integers. Let the projection mapping πj : Z → Zd−1 be defined by
d

πj (z1 , . . . , zd ) = (z1 , . . . , zj−1 , zj+1 , . . . , zd )

so that we “project out” the jth coordinate, and define the projected sets.

Aj = πj (A) = {πj (z) : z ∈ A}


n o
= z ∈ Zd−1 : there exists z? ∈ Z such that (z1 , z2 , . . . , zj−1 , z? , zj , . . . , zd−1 ) ∈ A .

Prove the Loomis-Whitney inequality, that is, that


  1
d d−1
Y
card(A) ≤  card(Aj ) .
j=1

137
Chapter 7

Privacy and disclosure limitation

In this chapter, we continue to build on our ideas on stability in different scenarios, ranging from
model fitting and concentration to interactive data analyses. Here, we show how stability ideas
allow us to provide a new type of protection: the privacy of participants in studies. Until the mid-
2000s, the major challenge in this direction had been a satisfactory definition of privacy, because
collection of side information often results in unforeseen compromises of private information. The
introduction of differential privacy—a type of stability in likelihood ratios for data releases from
differing samples—alleviated these challenges, providing a firm foundation on which to build private
estimators and other methodology. (Though it is possible to trace some of the definitions and major
insights in privacy back at least to survey sampling literature in the 1960s.) Consequently, in this
chapter we focus on privacy notions based on differential privacy and its cousins, developing the
information-theoretic stability ideas helpful to understand the protections it is possible to provide.

7.1 Disclosure limitation, privacy, and definitions


We begin this chapter with a few cautionary tales and examples, which motivate the coming
definitions of privacy that we consider. A natural belief might be that, given only certain summary
statistics of a large dataset, individuals in the data are protected. Yet this appears, by and large,
to be false. As an example, in 2008 Nils Homer and colleagues [107] showed that even releasing
aggregated genetic frequency statistics (e.g., frequency of single nucleotide polymorphisms (SNP) in
microarrays) can allow resolution of individuals within a database. Consequently, the US National
Institutes of Health (NIH), the Wellcome Trust, and the Broad Institute removed genetic summaries
from public access (along with imposing stricter requirements for private access) [161, 52].
Another hypothetical example may elucidate some of the additional challenges. Suppose that I
release a dataset that consists of the frequent times that posts are made worldwide that denigrate
government policies, but I am sure to remove all information such as IP addresses, usernames, or
other metadata excepting the time of the post. This might seem a priori reasonably safe, but now
suppose that an authoritarian government knows precisely when its citizens are online. Then by
linking the two datasets, the government may be able to track those who post derogatory statements
about their leaders.
Perhaps the strongest definition of privacy of databases and datasets is due to Dalenius [56], who
suggests that “nothing about an individual should be learnable from the database that cannot be
learned without access to the database.” But quickly, one can see that it is essentially impossible
to reconcile this idea with scientific advancement. Consider, for example, a situation where we

138
Lexture Notes on Statistics and Information Theory John Duchi

perform a study on smoking, and discover that smoking causes cancer. We publish the result, but
now we have “compromised” the privacy of everyone who smokes who did not participate in the
study: we know they are more likely to get cancer.
In each of these cases, the biggest challenge is one of side information: how can we be sure
that, when releasing a particular statistic, dataset, or other quantity that no adversary will be able
to infer sensitive data about participants in our study? We articulate three desiderata that—we
believe—suffice for satisfactory definitions of privacy. In discussion of private releases of data, we
require a bit of vocabulary. We term a (randomized) algorithm releasing data either a privacy
mechanism, consistent with much of the literature in privacy, or a channel, mapping from the input
sample to some output space, in keeping with our statistical and information-theoretic focus. In
no particular order, we wish our privacy mechanism, which takes as input a sample X1n ∈ X n and
releases some Z to satisfy the following.

i. Given the output Z, even an adversary knowing everyone in the study (excepting one person)
should not be able to test whether you belong to the study.

ii. If you participate in multiple “private” studies, there should be some graceful degradation
in the privacy protections, rather than a catastrophic failure. As part of this, any definition
should guarantee that further processing of the output Z of a private mechanism X1n → Z, in
the form of the Markov chain X1n → Z → Y , should not allow further compromise of privacy
(that is, a data-processing inequality). Additional participation in “private” studies should
continue to provide little additional information.

iii. The mechanism X1n → Z should be resilient to side information: even if someone knows
something about you, he should learn little about you if you belong to X1n , and this should
remain true even if the adversary later gleans more information about you.

The third desideratum is perhaps most elegantly phrased via a Bayesian perspective, where an
adversary has some prior beliefs π on the membership of a dataset (these prior beliefs can then
capture any side information the adversary has). The strongest adversary has a prior supported on
two samples {x1 , . . . , xn } and {x01 , . . . , x0n } differing in only a single element; a private mechanism
would then guarantee the adversary’s posterior beliefs (after the release X1n → Z) should not change
significantly.
Before continuing addressing these challenges, we take a brief detour to establish notation for the
remainder of the chapter. It will be convenient to consider randomized procedures acting on samples
n 1 Pn
themselves; a sample x1 is cleary isomorphic to the empirical distribution Pn = n i=1 1xi , and
for two empirical distributions Pn and Pn0 supported on {x1 , . . . , xn } and {x01 , . . . , x0n }, we evidently
have
n Pn − Pn0 TV = dham ({x1 , . . . , xn }, {x01 , . . . , x0n }),
and so we will identify samples with their empirical distributions. With this notational convenience
in place, we then identify
n
( )
1X
Pn = Pn = 1xi | xi ∈ X
n
i=1

as the set of all empirical distributions on n points in X and we also abuse notation in an obvious
way to define dham (Pn , Pn0 ) := n kPn − Pn0 kTV as the number of differing observations in the samples
Pn and Pn0 represent. A mechanism M is then a (typically) randomized mapping M : Pn → Z,

139
Lexture Notes on Statistics and Information Theory John Duchi

which we can identify with its induced Markov channel Q from X n → Z; we use the equivalent
views as is convenient.
The challenges of side information motivate Dwork et al.’s definition of differential privacy [74].
The key in differential privacy is that the noisy channel releasing statistics provides guarantees of
bounded likelihood ratios between neighboring samples, that is, samples differing in only a single
entry.
Definition 7.1 (Differential privacy). Let M : Pn → Z be a randomized mapping. Then M is
ε-differentially private if for all (measurable) sets S ⊂ Z and all Pn , Pn0 ∈ Pn with dham (Pn , Pn0 ) ≤ 1,

P(M (Pn ) ∈ S)
≤ eε . (7.1.1)
P(M (Pn0 ) ∈ S)
The intuition and original motivation for this definition are that an individual has little incentive
to participate (or not participate) in a study, as the individual’s data has limited effect on the
outcome.
The model (7.1.1) of differential privacy presumes that there is a trusted curator, such as a
hospital, researcher, or corporation, who can collect all the data into one centralized location, and
it is consequently known as the centralized model. A stronger model of privacy is the local model,
in which data providers trust no one, not even the data collector, and privatize their individual
data before the collector even sees it.
Definition 7.2 (Local differential privacy). A channel Q from X to Z is ε-locally differentially
private if for all measurable S ⊂ Z and all x, x0 ∈ X ,
Q(Z ∈ S | x)
≤ eε . (7.1.2)
Q(Z ∈ S | x0 )

It is clear that Definition 7.2 and the condition (7.1.2) are stronger than Definition 7.1: when
samples {x1 , . . . , xn } and {x01 , . . . , x0n } differ in at most one observation, then the local model (7.1.2)
guarantees that the densities
n
dQ(Z1n | {xi }) Y dQ(Zi | xi )
= ≤ eε ,
dQ(Z1n | {x0i }) dQ(Zi | x0i )
i=1

where the inequality follows because only a single ratio may contain xi 6= x0i .
In the remainder of this introductory section, we provide a few of the basic mechanisms in use
in differential privacy, then discuss its “semantics,” that is, its connections to the three desiderata
we outline above. In the coming sections, we revisit a few more advanced topics, in particular, the
composition of multiple private mechanisms and a few weakenings of differential privacy, as well as
more sophisticated examples.

7.1.1 Basic mechanisms


The basic mechanisms in either the local or centralized models of differential privacy use some type
of noise addition to ensure privacy. We begin with the simplest and oldest mechanism, randomized
response, for local privacy, due to Warner [173] in 1965.

Example 7.1.1 (Randomized response): We wish to have a participant in a study answer


a yes/no question about a sensitive topic (for example, drug use). That is, we would like to

140
Lexture Notes on Statistics and Information Theory John Duchi

estimate the proportion of the population with a characteristic (versus those without); call
these groups 0 and 1. Rather than ask the participant to answer the question specifically,
however, we give them a spinner with a face painted in two known areas, where the first
corresponds to group 0 and has area eε /(1 + eε ) and the second to group 1 and has area
1/(1 + eε ). Thus, when the participant spins the spinner, it lands in group 0 with probability
eε /(1 + eε ). Then we simply ask the participant, upon spinning the spinner, to answer “Yes”
if he or she belongs to the indicated group, “No” otherwise.
Let us demonstrate that this randomized response mechanism provides ε-local differential
privacy. Indeed, we have
Q(Yes | x = 0) Q(No | x = 0)
= e−ε and = eε ,
Q(Yes | x = 1) Q(No | x = 1)

so that Q(Z = z | x)/Q(Z = z | x0 ) ∈ [e−ε , eε ] for all x, z. That is, the randomized response
channel provides ε-local privacy. 3

The interesting question is, of course, whether we can still use this channel to estimate the
proportion of the population with the sensitive characteristic. Indeed, we can. We can provide
a somewhat more general analysis, however, which we now do so that we can give a complete
example.
Example 7.1.2 (Randomized response, continued): Suppose that we have an attribute of
interest, x, taking the values x ∈ {1, . . . , k}. Then we consider the channel (of Z drawn
conditional on x)
(

x with probability k−1+eε
Z= k−1
Uniform([k] \ {x}) with probability k−1+eε .

This (generalized) randomized response mechanism is evidently ε-locally private, satisfying


Definition 7.2.
Let p ∈ Rk+ , pT 1 = 1 indicate the true probabilities pi = P(X = i). Then by inspection, we
have
eε 1 eε − 1 1
P(Z = i) = pi ε
+ (1 − p i ) ε
= pi ε
+ ε .
k−1+e k−1+e e +k−1 e +k−1
cn ∈ Rk+ denote the empirical proportion of the Z observations in a sample of
Thus, letting b
size n, we have
eε + k − 1
 
1
pbn := cn − ε 1
eε − 1 e +k−1
b

satisfies E[b
pn ] = p, and we also have
2 k
2 X
eε + k − 1 eε + k − 1
 
1
pn − pk22 = cn ]k22
   
E kb E kb
cn − E[b = P(Z = j)(1−P(Z = j)).
eε − 1 n eε − 1
j=1

ε
pn − pk22 ] ≤ n1 ( e e+k−1 2
P
As j P(Z = j) = 1, we always have the bound E[kb ε −1 ) .

We may consider two regimes for simplicity: when ε ≤ 1 and when ε ≥ log k. In the former
case—the high privacy regime—we have k1 . P(Z = i) . k1 , so that the mean `2 squared error
2
scales as n1 kε2 . When ε ≥ log k is large, by contrast, we see that the error scales at worst as n1 ,
which is the “non-private” mean squared error. 3

141
Lexture Notes on Statistics and Information Theory John Duchi

While randomized response is essentially the standard mechanism in locally private settings, in
centralized privacy, the “standard” mechanism is Laplace noise addition because of its exponential
tails. In this case, we require a few additional definitions. Suppose that we wish to release some
d-dimensional function f (Pn ) of the sample distribution Pn (equivalently, the associated sample
X1n ), where f takes values in Rd . In the case that f is Lipschitz with respect to the Hamming
metric—that is, the counting metric on X n —it is relatively straightforward to develop private
mechanisms. To better reflect the nomenclature in the privacy literature and easier use in our
future development, for p ∈ [1, ∞] we define the global sensitivity of f by
n o
GSp (f ) := sup f (Pn ) − f (Pn0 ) p | dham (Pn , Pn0 ) ≤ 1 .
Pn ,Pn0 ∈Pn

This is simply the Lipschitz constant of f with respect to the Hamming metric. The global sensi-
tivity is a convenient metric, because it allows simple noise addition strategies.

Example 7.1.3 (Laplace mechanisms): Recall the Laplace distribution, parameterized by a


shape parameter β, which has density on R defined by
1
p(w) = exp(−|w|/β),

and the analogous d-dimensional variant, which has density
1
p(w) = exp(− kwk1 /β).
(2β)2
R∞
If W ∼ Laplace(β), W ∈ R, then E[W ] = 0 by symmetry, while E[W 2 ] = β1 0 w2 e−w/β = 2β 2 .
Suppose that f : Pn → Rd has finite global sensitivity for the `1 -norm,

GS1 (f ) = sup f (Pn ) − f (Pn0 ) 1 | dham (Pn , Pn0 ) ≤ 1, Pn , Pn0 ∈ Pn .




Letting L = GS1 (f ) be the Lipschitz constant for simplicity, if we consider the mechanism
defined by the addition of W ∈ Rd with independent Laplace(L/ε) coordinates,
iid
Z := f (Pn ) + W, Wj ∼ Laplace(L/ε), (7.1.3)

we have that Z is ε-differentially private. Indeed, for samples Pn , Pn0 differing in at most a
single example, Z has density ratio
q(z | Pn )  ε ε  ε 
0
= exp − kf (Pn ) − zk1 + f (Pn0 ) − z 1
≤ exp f (Pn ) − f (Pn0 ) 1
≤ exp(ε)
q(z | Pn ) L L L
by the triangle inequality and that f is L-Lipschitz with respect to the Hamming metric. Thus
Z is ε-differentially private. Moreover, we have

2dGS1 (f )2
E[kZ − f (Pn )k22 ] = ,
ε2
so that if L is small, we may report the value of f accurately. 3

The most common instances and applications of the Laplace mechanism are in estimation of
means and histograms. Let us demonstrate more carefully worked examples in these two cases.

142
Lexture Notes on Statistics and Information Theory John Duchi

Example 7.1.4 (Private one-dimensional mean estimation): Suppose that we have variables
Xi taking values in [−b, b] for some b < ∞, and wish to estimate E[X]. A natural function to
n 1 Pn
release is then f (X1 ) = X n = n i=1 Xi . This has Lipschitz constant 2b/n with respect to
the Hamming metric, because for any two samples x, x0 ∈ [−b, b]n differing in only entry i, we
have
1 2b
|f (x) − f (x0 )| = |xi − x0i | ≤
n n
because xi ∈ [−b, b]. Thus the Laplace mechanism (7.1.3) with the choice variance W ∼
Laplace(2b/(nε)) yields

1 8b2 b2 8b2
E[(Z − E[X])2 ] = E[(X n − E[X])2 ] + E[(Z − X n )2 ] = Var(X) + 2 2 ≤ + 2 2.
n n ε n n ε
We can privately release means with little penalty so long as ε  n−1/2 . 3

Example 7.1.5 (Private histogram (multinomial) release): Suppose that we wish to estimate
a multinomial distribution, or put differently, a histogram. That is, we have observations
X ∈ {1, . . . , k}, where k may be large, and wish to estimate pj := P(X = j) P for j = 1, . . . , k.
For a given sample xn1 , the empirical count vector pbn with coordinates pbn,j = n1 ni=1 1 {Xi = j}
satisfies
2
GS1 (b
pn ) =
n
0
because swapping a single example xi for xi may change the counts for at most two coordinates
j, j 0 by 1. Consequently, the Laplace noise addition mechanism
 
iid 2
Z = pbn + W, Wj ∼ Laplace

satisfies
8k
E[kZ − pbn k22 ] =
n 2 ε2
and consequently
k
8k 1X 8k 1
E[kZ − pk22 ] = 2 2
+ pj (1 − pj ) ≤ 2 2 + .
n ε n n ε n
j=1

This example shows one of the challenges of differentially private mechanisms: even in the case
where the quantity of interest is quite stable (insensitive to changes in the underlying sample,
or has small Lipschitz constant), it may be the case that the resulting mechanism adds noise
that introduces some dimension-dependent scaling. In this case, the conditions on privacy
levels acceptable for good estimation—in that the P rate of convergence is no different from the
non-private case, which achieves E[kb pn − pk2 ] = n kj=1 pj (1 − pj ) ≤ n1 are that ε  nk . Thus,
2 1

in the case that the histogram has a large number of bins, the naive noise addition strategy
cannot provide as much protection without sacrificing efficiency.
If instead of `2 -error we consider `∞ error, it is possible to provide somewhat more satisfying
iid
results in this case. Indeed, we know that P(kW k∞ ≥ t) ≤ k exp(−t/b) for Wj ∼ Laplace(b),
so that in the mechanism above we have
 
tnε
P(kZ − pbn k∞ ≥ t) ≤ k exp − all t ≥ 0,
2

143
Lexture Notes on Statistics and Information Theory John Duchi

so using that each coordinate of pbn is 1-sub-Gaussian, we have


r   
2 log k 2k tnε
E[kZ − pk∞ ] ≤ E[kb
pn − pk∞ ] + E[kW k∞ ] ≤ + inf t + exp −
n t≥0 nε 2
r
2 log k 2 log k 2
≤ + + .
n nε nε
−1/2 , we obtain rate of convergence at least
p this case, then, whenever ε  (n/ log k)
In
2 log k/n, which is a bit loose (as we have not controlled the variance of pbn ), but some-
what more satisfying than the k-dependent penalty above. 3

7.1.2 Resilience to side information, Bayesian perspectives, and data processing


One of the major challenges in the definition of privacy is to protect against side information,
especially because in the future, information about you may be compromised, allowing various
linkage attacks. With this in mind, we return to our three desiderata. First, we note the following
simple fact: if Z is a differentially private view of a sample X1n (or associated empirical distribution
Pn ), then any downstream functions Y are also differentially private. That is, if we have the Markov
chain Pn → Z → Y , then for any Pn , Pn0 ∈ Pn with dham (Pn , Pn0 ) ≤ 1, we have for any set A that
0
R R
P(Y ∈ A | x) P (Y ∈ A | z)q(z | Pn )dµ(z) ε R P (Y ∈ A | z)q(z | Pn )dµ(z)
= ≤ e = eε .
P(Y ∈ A | x0 )
R
P (Y ∈ A | z)q(z | Pn0 )dµ(z) P (Y ∈ A | z)q(z | Pn0 )dµ(z)
That is, any type of post-processing cannot reduce privacy.
With this simple idea out of the way, let us focus on our testing-based desideratum. In this
case, we consider a testing scenario, where an adversary wishes to test two hypotheses against one
another, where the hypotheses are
0
H0 : X1n = xn1 vs. H1 : X1n = (xi−1 n
1 , xi , xi+1 ),

so that the samples under H0 and H1 differ only in the ith observation Xi ∈ {xi , x0i }. Now, for a
channel taking inputs from X n and outputting Z ∈ Z, we define ε-conditional hypothesis testing
privacy by saying that
Q(Ψ(Z) = 1 | H0 , Z ∈ A) + Q(Ψ(Z) = 0 | H1 , Z ∈ A) ≥ 1 − ε (7.1.4)
for all sets A ⊂ Z satisfying Q(A | H0 ) > 0 and Q(A | H1 ) > 0. That is, roughly, no matter
what value Z takes on, the probability of error in a test of whether H0 or H1 is true—even with
knowledge of xj , j 6= i—is high. We then have the following proposition.
Proposition 7.1.6. Assume the channel Q is ε-differentially private. Then Q is also ε̄ = 1−e−2ε ≤
2ε-conditional hypothesis testing private.
Proof Let Ψ be any test of H0 versus H1 , and let B = {z | Ψ(z) = 1} be the acceptance region
of the test. Then
Q(A, B | H0 ) Q(A, B c | H1 )
Q(B | H0 , Z ∈ A) + Q(B c | H1 , Z ∈ A) = +
Q(A | H0 ) Q(A | H1 )
Q(A, B | H1 ) Q(A, B c | H1 )
≥ e−2ε +
Q(A | H1 ) Q(A | H1 )
Q(A, B | H1 ) + Q(A, B c | H1 )
≥ e−2ε ,
Q(A | H1 )

144
Lexture Notes on Statistics and Information Theory John Duchi

where the first inequality uses ε-differential privacy. Then we simply note that Q(A, B | H1 ) +
Q(A, B c | H1 ) = Q(A | H1 ).

So we see that (roughly), even conditional on the output of the channel, we still cannot test whether
the initial dataset was x or x0 whenever x, x0 differ in only a single observation.
An alternative perspective is to consider a Bayesian one, which allows us to more carefully
consider side information. In this case, we consider the following thought experiment. An adversary
has a set of prior beliefs π on X n , and we consider the adversary’s posterior π(· | Z) induced by
observing the output Z of some mechanism M . In this case, Bayes factors, which measure how
much prior and posterior distributions differ after observations, provide one immediate perspective.

Proposition 7.1.7. A mechanism M : Pn → Z is ε-differentially private if and only if for any


prior distribution π on Pn and any observation z ∈ Z, the posterior odds satisfy

π(Pn | z)
≤ eε
π(Pn0 | z)

for all Pn , Pn0 ∈ Pn with dham (Pn , Pn0 ) ≤ 1.

Proof Let q be the associated density of Z = M (·) (conditional or marginal). We have π(Pn |
z) = q(z | Pn )π(Pn )/q(z). Then

π(Pn | z) q(z | Pn )π(Pn ) π(Pn )


= ≤ eε
π(Pn0 | z) q(z | Pn0 )π(Pn0 ) π(Pn0 )

for all z, Pn , Pn0 if and only if M is ε-differentially private.

Thus we see that private channels mean that prior and posterior odds between two neighboring
samples cannot change substantially, no matter what the observation Z actually is.
For an an alternative view, we consider a somewhat restricted family of prior distributions,
where we now take the view of a sample xn1 ∈ X n . There is some annoyance in this calculation
in that the order of the sample may be important, but it at least gets toward some semantic
interpretation of differential privacy. We consider the adversary’s beliefs on whether a particular
value x belongs to the sample, but more precisely, we consider whether Xi = x. We assume that
the prior density π on X n satisfies

π(xn1 ) = π\i (x\i )πi (xi ), (7.1.5)

where x\i = (xi−1 n


1 , xi+1 ) ∈ X
n−1 . That is, the adversary’s beliefs about person i in the dataset

are independent of his beliefs about the other members of the dataset. (We assume that π is
a density with respect to a measure µ on X n−1 × X , where dµ(s, x) = dµ(s)dµ(x).) Under the
condition (7.1.5), we have the following proposition.

Proposition 7.1.8. Let Q be an ε-differentially private channel and let π be any prior distribution
satisfying condition (7.1.5). Then for any z, the posterior density πi on Xi satisfies

e−ε πi (x) ≤ πi (x | Z = z) ≤ eε πi (x).

145
Lexture Notes on Statistics and Information Theory John Duchi

Proof We abuse notation and for a sample s ∈ X n−1 , where s = (x1i−1 , xni+1 ), we let s ⊕i x =
(xi−1 n
1 , x, xi+1 ). Letting µ be the base measure on X
n−1 × X with respect to which π is a density
n
and q(· | x1 ) be the density of the channel Q, we have
R
s∈X n−1 q(z | s ⊕i x)π(s ⊕i x)dµ(s)
πi (x | Z = z) = R R
0 0 0
s∈X n−1 x0 ∈X q(z | s ⊕i x )π(s ⊕i x )dµ(s, x )
R
n−1 q(z | s ⊕i x)π(s ⊕i x)dµ(s)
(?)
s∈X
≤ eε R R
0 0
s∈X n−1 x0 ∈X q(z | s ⊕i x)π(s ⊕i x )dµ(s)dµ(x )
R
s∈X n−1 q(z | s ⊕i x)π\i (s)dµ(s)πi (x)
= eε R R
0 0
s∈X n−1 q(z | s ⊕i x)π\i (s)dµ(s) x0 ∈X πi (x )dµ(x )
= eε πi (x),
where inequality (?) follows from ε-differential privacy. The lower bound is similar.

Roughly, however, we see that Proposition 7.1.8 captures the idea that even if an adversary has
substantial prior knowledge—in the form of a prior distribution π on the ith value Xi and everything
else in the sample—the posterior cannot change much.

7.2 Weakenings of differential privacy


One challenge with the definition of differential privacy is that it can sometimes require the addition
of more noise to a desired statistic than is practical for real use. Moreover, the privacy considerations
interact in different ways with geometry: as we saw in Example 7.1.5, the Laplace mechanism
adds noise that introduces dimension-dependent scaling, which we discuss more in Example 7.2.9.
Consequently, it is of interest to develop weaker notions that—at least hopefully—still provide
appropriate and satisfactory privacy protections. To that end, we develop two additional types
of privacy that allow the development of more sophisticated and lower-noise mechanisms than
standard differential privacy; their protections are necessarily somewhat weaker but are typically
satisfactory.
We begin with a definition that allows (very rare) catostrophic privacy breaches—as long as the
probability of this event is extremely small (say, 10−20 ), these may be acceptable.
Definition 7.3. Let ε, δ ≥ 0. A mechanism M : Pn → Z is (ε, δ)-differentially private if for all
(measurable) sets S ⊂ Z and all neighboring samples Pn , Pn0 ,
P(M (Pn ) ∈ S) ≤ eε P(M (Pn0 ) ∈ S) + δ. (7.2.1)
One typically thinks of δ in the definition above as satisfying δ = δn , where δn  n−k for any
k ∈ N. (That is, δ decays super-polynomially to zero.) Some practitioners contend that all real-
world differentially private algorithms are in fact (ε, δ)-differentially private: while one may use
cryptographically secure random number generators, there is some possibility (call this δ) that a
cryptographic key may leak, or an encoding may be broken, in the future, making any mechanism
(ε, δ)-private at best for some δ > 0.
An alternative definition of privacy is based on Rényi divergences between distributions. These
are essentially simply monotonically transformed f divergences (recall Chapter 2.2), though their
structure is somewhat more amenable to analysis, especially in our contexts. With that in mind,
we define

146
Lexture Notes on Statistics and Information Theory John Duchi

Definition 7.4. Let P and Q be distributions on a space X with densities p and q (with respect to
a measure µ). For α ∈ [1, ∞], the Rényi-α-divergence between P and Q is

p(x) α
Z  
1
Dα (P ||Q) := log q(x)dµ(x).
α−1 q(x)

Here, the values α ∈ {1, ∞} are defined in terms of their respective limits.
1
Rényi divergences satisfy exp((α − 1)Dα(P ||Q)) = 1 + Df (P ||Q), i.e., Dα(P ||Q) = α−1 log(1 +
α
Df (P ||Q)), for the f -divergence defined by f (t) = t − 1, so that they inherit a number of the
properties of such divergences. We enumerate a few here for later reference.

Proposition 7.2.1 (Basic facts on Rényi divergence). Rényi divergences satisfy the following.

i. The divergence Dα (P ||Q) is non-decreasing in α.

ii. limα↓1 Dα (P ||Q) = Dkl (P ||Q) and limα↑∞ Dα (P ||Q) = sup{t | Q(p(X)/q(X) ≥ t) > 0}.

iii. Let K(· | x) be a Markov kernel from X → Z as in Proposition 2.2.13, and let KP and KQ be
the induced marginals of P and Q under K, respectively. Then Dα (KP ||KQ ) ≤ Dα (P ||Q).

We leave the proof of this proposition as Exercise 7.1, noting that property i is a consequence
of Hölder’s inequality, property ii is by L’Hopital’s rule, and property iii is an immediate conse-
quence of Proposition 2.2.13. Rényi divergences also tensorize nicely—generalizing the tensoriza-
tion properties of KL-divergence and information of Chapter 2 (recall the chain rule (2.1.6) for
KL-divergence)—and we return to this later. As a preview, however, these tensorization proper-
ties allow us to prove that the composition of multiple private data releases remains appropriately
private.
With these preliminaries in place, we can then provide

Definition 7.5 (Rényi-differential privacy). Let ε ≥ 0 and α ∈ [1, ∞]. A channel Q from Pn to
output space Z is (ε, α)-Rényi private if for all neighboring samples Pn , Pn0 ∈ Pn ,

Dα Q(· | Pn )||Q(· | Pn0 ) ≤ ε.



(7.2.2)

Clearly, any ε-differentially private channel is also (ε, α)-Rényi private for any α ≥ 1; as we soon
see, we can provide tighter guarantees than this.

7.2.1 Basic mechanisms


We now describe a few of the basic mechanisms that provide guarantees of (ε, δ)-differential privacy
and (ε, α)-Rényi privacy. The advantage for these settings is that they allow mechanisms that more
naturally handle vectors in `2 , and smoothness with respect to Euclidean norms, than with respect
to `1 , which is most natural for pure ε-differential privacy. A starting point is the following example,
which we will leverage frequently.

Example 7.2.2 (Rényi divergence between Gaussian distributions): Consider normal distri-
butions N(µ0 , Σ) and N(µ1 , Σ). Then
α
Dα (N(µ0 , Σ)||N(µ1 , Σ)) = (µ0 − µ1 )T Σ−1 (µ0 − µ1 ). (7.2.3)
2

147
Lexture Notes on Statistics and Information Theory John Duchi

To see this equality, we compute the appriate integral of the densities. Let p and q be the
densities of N(µ0 , Σ) and N(µ1 , Σ), respectively. Then letting Eµ1 denote expectation over
X ∼ N(µ1 , Σ), we have
p(x) α
Z    α
h α i
q(x)dx = Eµ1 exp − (X − µ0 )T Σ−1 (X − µ0 ) + (X − µ1 )T Σ−1 (X − µ1 )
q(x) 2 2
(i)
h  α i
= Eµ1 exp − (µ0 − µ1 )T Σ−1 (µ0 − µ1 ) + α(µ0 − µ1 )T Σ−1 (X − µ1 )
2
α2
 
(ii) α T −1 T −1
= exp − (µ0 − µ1 ) Σ (µ0 − µ1 ) + (µ0 − µ1 ) Σ (µ0 − µ1 ) ,
2 2

where equality (i) is simply using that (x − a)2 − (x − b)2 = (a − b)2 + 2(b − a)(x − b) and
equality (ii) follows because (µ0 − µ1 )T Σ−1 (X − µ1 ) ∼ N(0, (µ1 − µ0 )T Σ−1 (µ1 − µ0 )) under
X ∼ N(µ1 , Σ). Noting that −α + α2 = α(α − 1) and taking logarithms gives the result. 3

Example 7.2.2 is the key to developing different privacy-preserving schemes under Rényi privacy.
Let us reconsider Example 7.1.3, except that instead of assuming the function f of interest is smooth
with respect to `1 norm, we use the `2 -norm.
Example 7.2.3 (Gaussian mechanisms): Suppose that f : Pn → Rd has Lipschitz constant
L with respect to the `2 -norm (for the Hamming metric dham ), that is, global `2 -sensitivity

GS2 (f ) = sup f (Pn ) − f (Pn0 ) 2 | dham (Pn , Pn0 ) ≤ 1 ≤ L.




Then, for any variance σ 2 > 0, we have that the mechanism

Z = f (Pn ) + W, W ∼ N(0, σ 2 I)

satisfies
α 2 α
Dα N(f (Pn ), σ 2 )||N(f (Pn0 ), σ 2 ) = 2 f (Pn ) − f (Pn0 ) 2 ≤ 2 L2

2σ 2σ
0
for neighboring samples Pn , Pn . Thus, if we have Lipschitz constant L and desire (ε, α)-Rényi
2
privacy, we may take σ 2 = L2εα , and then the mechanism

L2 α
 
Z = f (Pn ) + W W ∼ N 0, I (7.2.4)

satisfies (ε, α)-Rényi privacy. 3
Certain special cases can make this more concrete. Indeed, suppose we wish to estimate a mean
iid
E[X] where Xi ∼ P for some distribution P such that kXi k2 ≤ r with probability 1 for some
radius.
Example 7.2.4 (Bounded mean estimation with Gaussian mechanisms): Letting f (X1n ) =
X n be the sample mean, where Xi satisfy kXi k2 ≤ r as above, we see immediately that
2r
GS2 (f ) = .
n
2r
In this case, the Gaussian mechanism (7.2.4) with L = n yields
h
2
i 2dr2 α
E Z − Xn 2
= E[kW k22 ] = .
n2 ε

148
Lexture Notes on Statistics and Information Theory John Duchi

Then we have
r2 2dr2 α
E[kZ − E[X]k22 ] = E[kX n − E[X]k22 ] + E[kZ − X n k22 ] ≤ + 2 .
n n ε
It is not immediately apparent how to compare this quantity to the case for the Laplace mech-
anism in Example 7.1.3, but we will return to this shortly once we have developed connections
between the various privacy notions we have developed. 3

7.2.2 Connections between privacy measures


An important consideration in our development of privacy definitions and mechanisms is to un-
derstand the relationships between the definitions, and when a channel Q satisfying one of the
definitions satisfies one of our other definitions. Thus, we collect a few different consequences of
our definitions, which help to show the various definitions are stronger or weaker than others.
First, we argue that ε-differential privacy implies stronger values of Rényi-differential privacy.

Proposition 7.2.5. Let ε ≥ 0 and let P and Q be distributions such that e−ε ≤ P (A)/Q(A) ≤ eε
for all measurable sets A. Then for any α ∈ [1, ∞],
 
3α 2
Dα (P ||Q) ≤ min ε ,ε .
2

As an immediate corollary, we have

Corollary 7.2.6. Let ε ≥ 0 and assume that Q is ε-differentially private. Then for any α ≥ 1, Q
is (min{ 3α 2
2 ε , ε}, α)-Rényi private.

Before proving the proposition, let us see its implications for Example 7.2.4 versus estimation
under ε-differential privacy. Let ε ≤ 1, so that roughly to have “similar” privacy, we require
0 2
that our Rényi private channels
√ √ | x)||Q(· | x )) ≤ ε . The `1 -sensitivity of the mean
satisfy Dα (Q(·
satisfies kxn − x0 n k1 ≤ dkxn − x0 n k2 ≤ 2 dr/n for neighboring samples. Then the Laplace
mechanism (7.1.3) satisfies

2 8r2
E[kZLaplace − E[X]k22 ] = E[ X n − E[X] 2 ] + · d2 ,
n 2 ε2
while the Gaussian mechanism under (ε2 , α)-Rényi privacy will yield

2 2r2
E[kZGauss − E[X]k22 ] = E[ X n − E[X] 2 ] + · dα.
n 2 ε2
This is evidently better than the Laplace mechanism whenever α < d.
Proof of Proposition 7.2.5 We asume that P and Q have densities p and q with respect to a
base measure µ, which is no loss of generality, whence the ratio condition implies that e−ε ≤ p/q ≤ eε
1 α
R
and Dα (P ||Q) = α−1 log (p/q) qdµ. We prove the result assuming that α ∈ (1, ∞), as continuity
gives the result for α ∈ {1, ∞}.
First, it is clear that Dα (P ||Q) ≤ ε always. For the other term in the minimum, let us assume
that α ≤ 1 + 1ε and ε ≤ 1. If either of these fails, the result is trivial, because for α > 1 + 1ε we
have 32 αε2 ≥ 32 ε ≥ ε, and similarly ε ≥ 1 implies 23 αε2 ≥ ε.

149
Lexture Notes on Statistics and Information Theory John Duchi

Now we perform a Taylor approximation of t 7→ (1 + t)α . By Taylor’s theorem, we have for any
t > −1 that
α(α − 1)
(1 + t)α = 1 + αt + t)α−2 t2
(1 + e
2

t ∈ [0, t] (or [t, 0] if t < 0). In particular, if 1 + t ≤ c, then (1 + t)α ≤ 1 + αt +


for some e
α(α−1)
2 max{1, cα−2 }t2 . Now, we compute the divergence: we have

p(z) α
Z  
exp ((α − 1)Dα (P ||Q)) = q(z)dµ(z)
q(z)
Z  α
p(z)
= 1+ − 1 q(z)dµ(z)
q(z)
Z   Z  2
p(z) α(α − 1) p(z)
≤1+α − 1 q(z)dµ(z) + max{1, exp(ε(α − 2))} − 1 q(z)dµ(z)
q(z) 2 q(z)
α(α − 1) ε[α−2]+
≤1+ e · (eε − 1)2 .
2
Now, we know that α − 2 ≤ 1/ε − 1 by assumption, so using that log(1 + x) ≤ x, we obtain
α ε
Dα (P ||Q) ≤ (e − 1)2 · exp([1 − ε]+ ).
2
3α 2
Finally, a numerical calculation yields that this quantity is at most 2 ε for ε ≤ 1.

We can also provide connections from (ε, α)-Rényi privacy to (ε, δ)-differential privacy, and
then from there to ε-differential privacy. We begin by showing how to develop (ε, δ)-differential
privacy out of Rényi privacy. Another way to think about this proposition is that whenever two
distributions P and Q are close in Rényi divergence, then there is some limited “amplification” of
probabilities that is possible in moving from one to the other.

Proposition 7.2.7. Let P and Q satisfy Dα (P ||Q) ≤ ε. Then for any set A,
 
α−1 α−1
P (A) ≤ exp ε Q(A) α .
α

Consequently, for any δ > 0,


     
1 1 1 1
P (A) ≤ min exp ε + log Q(A), δ ≤ exp ε + log Q(A) + δ.
α−1 δ α−1 δ

As above, we have an immediate corollary to this result.


1
Corollary 7.2.8. Assume that M is (ε, α)-Rényi private. Then it is also (ε + α−1 log 1δ , δ)-
differentially private for any δ > 0.

Before turning to the proof of the proposition, we show how it can provide prototypical (ε, δ)-
private mechanisms via Gaussian noise addition.

150
Lexture Notes on Statistics and Information Theory John Duchi

Example 7.2.9 (Gaussian mechanisms, continued): Consider Example 7.2.3, where f : Pn →


Rd has `2 -sensitivity L. Then by Example 7.2.2, the Gaussian mechanism Z = f (Pn ) + W for
2
W ∼ N(0, σ 2 I) is ( αL
2σ 2
, α)-Rényi private for all α ≥ 1. Combining this with Corollary 7.2.8,
the Gaussian mechanism is also
 2 
αL 1 1
+ log , δ -differentially private
2σ 2 α−1 δ
p
for any δ > 0 and α > 1. Optimizing first over α by taking α = 1 + 2σ 2 log δ −1 /L2 , we see
L2
p
that the channel is ( 2σ 2 + 2L2 log δ −1 /σ 2 , δ)-differentially private. Thus we have that the
Gaussian mechanism
( )
2 2 2 8 log 1δ 1
Z = f (Pn ) + W, W ∼ N(0, σ I) for σ = L max , (7.2.5)
ε2 ε

is (ε, δ)-differentially private.


To continue with our `2 -bounded mean-estimation in Example 7.2.4, let us assume that
ε < 8 log 1δ , in which case the Gaussian mechanism (7.2.5) with L2 = r2 /n2 achieves (ε, δ)-
differential privacy, and we have

2 r2 1
E[kZGauss − E[X]k22 ] = E[ X n − E[X] 2 ] + O(1) 2 2
· d log .
n ε δ
Comparing to the previous cases, we see an improvement over the Laplace mechanism whenever
log 1δ  d, or that δ  e−d . 3

Proof of Proposition 7.2.7 We use the data processing inequality of Proposition [Link],
which shows that
P (A) α
  
1
ε ≥ Dα (P ||Q) ≥ log Q(A) .
α−1 Q(A)
Rearranging and taking exponentials, we immediately obtain the first claim of the proposition.
α
For the second, we require a bit more work. First, let us assume that Q(A) > e−ε δ α−1 . Then
we have by the first claim of the proposition that
 
α−1 1 1
P (A) ≤ exp ε + log Q(A)
α α Q(A)
   
α−1 1 1 1 1 1
≤ exp ε+ ε+ log Q(A) = exp ε + log Q(A).
α α α−1 δ α−1 δ
α
On the other hand, when Q(A) ≤ e−ε δ α−1 , then again using the first result of the proposition,
 
α−1
P (A) ≤ exp (ε + log Q(A))
α
 
α − 1 α
≤ exp ε−ε+ log δ = δ.
α α−1

This gives the second claim of the proposition.

151
Lexture Notes on Statistics and Information Theory John Duchi

Finally, we develop our last set of connections, which show how we may relate (ε, δ)-private
channels with ε-private channels. To provide this definition, we require one additional weakened
notion of divergence, which relates (ε, δ)-differential privacy to Rényi-α-divergence with α = ∞.
We define  
δ P (S) − δ
D∞ (P ||Q) := sup log | P (S) > δ ,
S⊂X Q(S)
where the supremum is over measurable sets. Evidently equivalent to this definition is that
δ (P ||Q) ≤ ε if and only if
D∞
P (S) ≤ eε Q(S) + δ for all S ⊂ X .
Then we have the following lemma.
Lemma 7.2.10. Let ε > 0 and δ ∈ (0, 1), and let P and Q be distributions on a space X .
δ (P ||Q) ≤ ε if and only if there exists a probability distribution R on X such that
(i) We have D∞
kP − RkTV ≤ δ and D∞ (R||Q) ≤ ε.
(ii) We have D∞δ (P ||Q) ≤ ε and D δ (Q||P ) ≤ ε if and