0% found this document useful (0 votes)
36 views668 pages

Lecture Notes

Uploaded by

kay
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
36 views668 pages

Lecture Notes

Uploaded by

kay
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Statistics and Information Theory

John Duchi

October 22, 2024


Contents

1 Introduction and setting 10


1.1 Information theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.2 Moving to statistics and machine learning . . . . . . . . . . . . . . . . . . . . . . . . 11
1.3 Outline and chapter discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
1.4 A remark about measure theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

2 An information theory review 15


2.1 Basics of Information Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.1.1 Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.1.2 Chain rules and related properties . . . . . . . . . . . . . . . . . . . . . . . . 20
2.1.3 Data processing inequalities: . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.2 General divergence measures and definitions . . . . . . . . . . . . . . . . . . . . . . . 23
2.2.1 Partitions, algebras, and quantizers . . . . . . . . . . . . . . . . . . . . . . . . 23
2.2.2 KL-divergence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.2.3 f -divergences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.2.4 Inequalities and relationships between divergences . . . . . . . . . . . . . . . 28
2.2.5 Convexity and data processing for divergence measures . . . . . . . . . . . . 32
2.3 First steps into optimal procedures: testing inequalities . . . . . . . . . . . . . . . . 33
2.3.1 Le Cam’s inequality and binary hypothesis testing . . . . . . . . . . . . . . . 33
2.3.2 Fano’s inequality and multiple hypothesis testing . . . . . . . . . . . . . . . . 35
2.4 A first operational result: entropy and source coding . . . . . . . . . . . . . . . . . . 37
2.4.1 The source coding problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
2.4.2 The Kraft-McMillan inequalities . . . . . . . . . . . . . . . . . . . . . . . . . 38
2.4.3 Entropy rates and longer codes . . . . . . . . . . . . . . . . . . . . . . . . . . 41
2.5 Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
2.6 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

3 Exponential families and statistical modeling 48


3.1 Exponential family models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
3.2 Why exponential families? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
3.2.1 Fitting an exponential family model . . . . . . . . . . . . . . . . . . . . . . . 53
3.3 Divergence measures and information for exponential families . . . . . . . . . . . . . 54
3.4 Generalized linear models and regression . . . . . . . . . . . . . . . . . . . . . . . . . 55
3.4.1 Fitting a generalized linear model from a sample . . . . . . . . . . . . . . . . 58
3.4.2 The information in a generalized linear model . . . . . . . . . . . . . . . . . . 59
3.5 Lower bounds on testing a parameter’s value . . . . . . . . . . . . . . . . . . . . . . 61

1
Lexture Notes on Statistics and Information Theory John Duchi

3.6 Deferred proofs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63


3.6.1 Proof of Proposition 3.2.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
3.7 Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
3.8 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65

I Concentration, information, stability, and generalization 66

4 Concentration Inequalities 67
4.1 Basic tail inequalities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
4.1.1 Sub-Gaussian random variables . . . . . . . . . . . . . . . . . . . . . . . . . . 69
4.1.2 Sub-exponential random variables . . . . . . . . . . . . . . . . . . . . . . . . 73
4.1.3 Orlicz norms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
4.1.4 First applications of concentration: random projections . . . . . . . . . . . . 80
4.1.5 A second application of concentration: codebook generation . . . . . . . . . . 81
4.2 Martingale methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
4.2.1 Sub-Gaussian martingales and Azuma-Hoeffding inequalities . . . . . . . . . 84
4.2.2 Examples and bounded differences . . . . . . . . . . . . . . . . . . . . . . . . 85
4.3 Matrix concentration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
4.4 Technical proofs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
4.4.1 Proof of Theorem 4.1.11 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
4.4.2 Proof of Theorem 4.1.15 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
4.4.3 Proof of Theorem 5.1.6 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
4.4.4 Proof of Proposition 4.3.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
4.5 Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
4.6 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95

5 Estimation and generalization 102


5.1 Uniformity and metric entropy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
5.1.1 Symmetrization and uniform laws . . . . . . . . . . . . . . . . . . . . . . . . 102
5.1.2 Metric entropy, coverings, and packings . . . . . . . . . . . . . . . . . . . . . 106
5.1.3 Application: matrix concentration . . . . . . . . . . . . . . . . . . . . . . . . 109
5.2 Generalization bounds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
5.2.1 Finite and countable classes of functions . . . . . . . . . . . . . . . . . . . . . 112
5.2.2 Large classes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
5.2.3 Structural risk minimization and adaptivity . . . . . . . . . . . . . . . . . . . 116
5.3 M-estimators and estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118
5.3.1 Standard conditions and convex optimization . . . . . . . . . . . . . . . . . . 119
5.3.2 Some growth properties of convex functions . . . . . . . . . . . . . . . . . . . 120
5.3.3 Convergence analysis for convex M-estimators . . . . . . . . . . . . . . . . . . 122
5.3.4 Consequences for exponential families and generalized linear models . . . . . 124
5.3.5 Proof of Theorem 5.3.8 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126
5.4 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127

2
Lexture Notes on Statistics and Information Theory John Duchi

6 Generalization and stability 132


6.1 The variational representation of Kullback-Leibler divergence . . . . . . . . . . . . . 133
6.2 PAC-Bayes bounds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134
6.2.1 Relative bounds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137
6.2.2 A large-margin guarantee . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139
6.2.3 A mutual information bound . . . . . . . . . . . . . . . . . . . . . . . . . . . 141
6.3 Interactive data analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142
6.3.1 The interactive setting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143
6.3.2 Second moment errors and mutual information . . . . . . . . . . . . . . . . . 144
6.3.3 Limiting interaction in interactive analyses . . . . . . . . . . . . . . . . . . . 145
6.3.4 Error bounds for a simple noise addition scheme . . . . . . . . . . . . . . . . 150
6.4 Bibliography and further reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152
6.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152

7 Advanced concentration inequalities 157


7.1 From divergences to concentration and back . . . . . . . . . . . . . . . . . . . . . . . 157
7.1.1 Concentration of covariance matrices via the variational representation . . . . 159
7.1.2 A generalized connection between moment generating functions and divergence161
7.2 Transportation inequalitites . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163
7.2.1 A tensorized transportation inequality . . . . . . . . . . . . . . . . . . . . . . 165
7.2.2 A heuristic proof of Theorem 7.2.1 . . . . . . . . . . . . . . . . . . . . . . . . 166
7.2.3 Proof of Corollary 7.2.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167
7.3 Some applications of concentration and the variational inequality . . . . . . . . . . . 168
7.3.1 Metric Gaussianity, transport inequalities, and expansion of sets . . . . . . . 168
7.3.2 A weak and strong converse for hypothesis testing . . . . . . . . . . . . . . . 171
7.4 Discussion and bibliographic remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . 174
7.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174

8 Privacy and disclosure limitation 177


8.1 Disclosure limitation, privacy, and definitions . . . . . . . . . . . . . . . . . . . . . . 177
8.1.1 Basic mechanisms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179
8.1.2 Resilience to side information, Bayesian perspectives, and data processing . . 183
8.2 Weakenings of differential privacy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 185
8.2.1 Basic mechanisms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 186
8.2.2 Connections between privacy measures . . . . . . . . . . . . . . . . . . . . . . 188
8.2.3 Side information protections under weakened notions of privacy . . . . . . . . 191
8.3 Composition and privacy based on divergence . . . . . . . . . . . . . . . . . . . . . . 194
8.3.1 Composition of Rényi-private channels . . . . . . . . . . . . . . . . . . . . . . 194
8.3.2 Privacy games and composition . . . . . . . . . . . . . . . . . . . . . . . . . . 195
8.4 Additional mechanisms and privacy-preserving algorithms . . . . . . . . . . . . . . . 197
8.4.1 The exponential mechanism . . . . . . . . . . . . . . . . . . . . . . . . . . . . 197
8.4.2 Local sensitivities and the inverse sensitivity mechanism . . . . . . . . . . . . 200
8.5 Deferred proofs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 205
8.5.1 Proof of Lemma 8.2.10 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 205
8.6 Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 208
8.7 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 208

3
Lexture Notes on Statistics and Information Theory John Duchi

II Fundamental limits and optimality 215

9 Minimax lower bounds: the Le Cam, Fano, and Assouad methods 217
9.1 Basic framework and minimax risk . . . . . . . . . . . . . . . . . . . . . . . . . . . . 217
9.2 Preliminaries on methods for lower bounds . . . . . . . . . . . . . . . . . . . . . . . 219
9.2.1 From estimation to testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . 220
9.2.2 Inequalities between divergences and product distributions . . . . . . . . . . 221
9.2.3 Metric entropy and packing numbers . . . . . . . . . . . . . . . . . . . . . . . 223
9.3 Le Cam’s method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 224
9.4 Fano’s method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 226
9.4.1 The classical (local) Fano method . . . . . . . . . . . . . . . . . . . . . . . . 226
9.4.2 A distance-based Fano method . . . . . . . . . . . . . . . . . . . . . . . . . . 231
9.5 Assouad’s method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 235
9.5.1 Well-separated problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 235
9.5.2 From estimation to multiple binary tests . . . . . . . . . . . . . . . . . . . . . 235
9.5.3 Example applications of Assouad’s method . . . . . . . . . . . . . . . . . . . 237
9.6 Deferred proofs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 239
9.6.1 Proof of Proposition 9.4.6 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 239
9.6.2 Proof of Corollary 9.4.7 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 240
9.6.3 Proof of Lemma 9.5.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 240
9.7 Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 241
9.8 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 241

10 Beyond local minimax techniques 250


10.1 Nonparametric regression: minimax upper and lower bounds . . . . . . . . . . . . . 250
10.1.1 Kernel estimates of the function . . . . . . . . . . . . . . . . . . . . . . . . . 251
10.1.2 Minimax lower bounds on estimation with Assouad’s method . . . . . . . . . 253
10.2 Global Fano Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 256
10.2.1 A mutual information bound based on metric entropy . . . . . . . . . . . . . 256
10.2.2 Minimax bounds using global packings . . . . . . . . . . . . . . . . . . . . . . 258
10.2.3 Example: non-parametric regression . . . . . . . . . . . . . . . . . . . . . . . 259
10.3 Strong converses and high-probability lower bounds . . . . . . . . . . . . . . . . . . . 260
10.3.1 Refined Fano inequalitites . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 262
10.3.2 High probability estimation lower bounds . . . . . . . . . . . . . . . . . . . . 265
10.3.3 Proof of Theorem 10.3.5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 266
10.4 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 268

11 Constrained risk inequalities 269


11.1 Strong data processing inequalities . . . . . . . . . . . . . . . . . . . . . . . . . . . . 269
11.2 Local privacy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 272
11.3 Communication complexity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 276
11.3.1 Classical communication complexity problems . . . . . . . . . . . . . . . . . . 276
11.3.2 Deterministic communication: lower bounds and structure . . . . . . . . . . . 279
11.3.3 Randomization, information complexity, and direct sums . . . . . . . . . . . 281
11.3.4 The structure of randomized communication and communication complexity
of primitives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 285
11.4 Communication complexity in estimation . . . . . . . . . . . . . . . . . . . . . . . . 288

4
Lexture Notes on Statistics and Information Theory John Duchi

11.4.1 Direct sum communication bounds . . . . . . . . . . . . . . . . . . . . . . . . 289


11.4.2 Communication data processing . . . . . . . . . . . . . . . . . . . . . . . . . 290
11.4.3 Applications: communication and privacy lower bounds . . . . . . . . . . . . 292
11.5 Proof of Theorem 11.4.4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 296
11.5.1 Proof of Lemma 11.5.3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 300
11.6 Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 301
11.7 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 302

12 Squared error and asymptotically exact optimality guarantees 307


12.1 The Cramér-Rao inequality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 308
12.1.1 Compact sets and the failure of the Cramér-Rao bound . . . . . . . . . . . . 309
12.1.2 Regularization and the failure of the Cramér-Rao bound . . . . . . . . . . . . 309
12.2 The van Trees inequality: a Bayesian Cramér-Rao bound . . . . . . . . . . . . . . . 310
12.2.1 The van Trees inequality in one dimension . . . . . . . . . . . . . . . . . . . . 311
12.2.2 The van Trees inequality in d-dimensions . . . . . . . . . . . . . . . . . . . . 312
12.2.3 The van Trees inequality for a function of the parameter . . . . . . . . . . . . 314
12.3 Beyond parametric problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 318
12.3.1 An extended example: M-estimation lower bounds . . . . . . . . . . . . . . . 321
12.4 Super-efficiency and instance optimality . . . . . . . . . . . . . . . . . . . . . . . . . 323
12.5 Applications in privacy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 325
12.6 Bibliography and further reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 325
12.7 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 325

13 Testing and functional estimation 328


13.1 Geometrizing rates of convergence . . . . . . . . . . . . . . . . . . . . . . . . . . . . 328
13.1.1 Fisher information and divergence measures . . . . . . . . . . . . . . . . . . . 332
13.1.2 Valid asymptotic information expansions of divergences . . . . . . . . . . . . 334
13.2 Le Cam’s convex hull method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 336
13.2.1 The χ2 -mixture bound . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 337
13.2.2 Estimating the norm of a Gaussian vector . . . . . . . . . . . . . . . . . . . . 340
13.2.3 Lower bounds on estimating integral functionals . . . . . . . . . . . . . . . . 342
13.3 Minimax hypothesis testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 345
13.3.1 Detecting a difference in populations . . . . . . . . . . . . . . . . . . . . . . . 346
13.3.2 Signal detection and testing a Gaussian mean . . . . . . . . . . . . . . . . . . 348
13.3.3 Goodness of fit and two-sample tests for multinomials . . . . . . . . . . . . . 350
13.3.4 Detecting sparse signals and phase transitions . . . . . . . . . . . . . . . . . . 353
13.4 Instance-optimal lower bounds and super-efficiency . . . . . . . . . . . . . . . . . . . 358
13.4.1 Risk transfer inequalities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 358
13.4.2 A general risk transfer bound . . . . . . . . . . . . . . . . . . . . . . . . . . . 362
13.4.3 Risk transfer with mixtures . . . . . . . . . . . . . . . . . . . . . . . . . . . . 363
13.5 Deferred and technical proofs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 366
13.5.1 Proof of Lemma 13.1.6 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 366
13.5.2 Proof of Lemma 13.1.10 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 366
13.6 Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 368
13.7 A useful divergence calculation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 369
13.8 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 370

5
Lexture Notes on Statistics and Information Theory John Duchi

III Entropy, predictions, divergences, and information 377

14 Predictions, loss functions, and entropies 379


14.1 Proper losses, scoring rules, and generalized entropies . . . . . . . . . . . . . . . . . 380
14.1.1 A convexity primer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 381
14.1.2 From a proper loss to an entropy . . . . . . . . . . . . . . . . . . . . . . . . . 383
14.1.3 The information in an experiment . . . . . . . . . . . . . . . . . . . . . . . . 385
14.2 Characterizing proper losses and Bregman divergences . . . . . . . . . . . . . . . . . 386
14.2.1 Characterizing proper losses for Y taking finitely many vales . . . . . . . . . 386
14.2.2 General proper losses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 389
14.2.3 Proper losses and vector-valued Y . . . . . . . . . . . . . . . . . . . . . . . . 393
14.3 From entropies to convex losses, arbitrary predictions, and link functions . . . . . . . 396
14.3.1 Convex conjugate linkages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 396
14.3.2 Convex conjugate linkages with affine constraints . . . . . . . . . . . . . . . . 400
14.4 Exponential families, maximum entropy, and log loss . . . . . . . . . . . . . . . . . . 403
14.4.1 Maximizing entropy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 405
14.5 Technical and deferred proofs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 409
14.5.1 Finalizing the proof of Theorem 14.2.15 . . . . . . . . . . . . . . . . . . . . . 409
14.5.2 Proof of Proposition 14.4.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 410
14.5.3 Proof of Proposition 14.4.3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 411
14.6 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 412

15 Calibration and Proper Losses 416


15.1 Proper losses and calibration error . . . . . . . . . . . . . . . . . . . . . . . . . . . . 417
15.2 Measuring calibration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 420
15.2.1 The impossibility of measuring calibration . . . . . . . . . . . . . . . . . . . . 420
15.2.2 Alternative calibration measures . . . . . . . . . . . . . . . . . . . . . . . . . 423
15.3 Auditing and improving calibration at the population level . . . . . . . . . . . . . . 426
15.3.1 The post-processing gap and calibration audits for squared error . . . . . . . 426
15.3.2 Calibration audits for losses based on conjugate linkages . . . . . . . . . . . . 428
15.3.3 A population-level algorithm for calibration . . . . . . . . . . . . . . . . . . . 430
15.4 Calibeating: improving squared error by calibration . . . . . . . . . . . . . . . . . . 431
15.4.1 Proof of Theorem 15.4.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 434
15.5 Continuous and equivalent calibration measures . . . . . . . . . . . . . . . . . . . . . 437
15.5.1 Calibration measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 438
15.5.2 Equivalent calibration measures . . . . . . . . . . . . . . . . . . . . . . . . . . 440
15.6 Deferred technical proofs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 446
15.6.1 Proof of Lemma 15.2.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 446
15.6.2 Proof of Proposition 15.5.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 447
15.6.3 Proof of Lemma 15.5.4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 448
15.6.4 Proof of Theorem 15.5.6 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 449
15.7 Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 451
15.8 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 452

6
Lexture Notes on Statistics and Information Theory John Duchi

16 Classification, Divergences, and Surrogate Risk 454


16.1 Surrogate risk consistency in binary classification . . . . . . . . . . . . . . . . . . . . 455
16.1.1 A general classification calibration result . . . . . . . . . . . . . . . . . . . . . 458
16.1.2 Convex losses for binary classification . . . . . . . . . . . . . . . . . . . . . . 459
16.1.3 Proof of Theorem 16.1.5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 460
16.2 General surrogate risk consistency . . . . . . . . . . . . . . . . . . . . . . . . . . . . 463
16.2.1 Uniform calibration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 464
16.2.2 Pointwise calibration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 465
16.2.3 Examples: multiclass surrogate risk consistency . . . . . . . . . . . . . . . . . 466
16.3 Generalized entropies and surrogate risk consistency . . . . . . . . . . . . . . . . . . 468
16.3.1 Proof of Theorem 16.3.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 470
16.4 Structured prediction and generalized entropies . . . . . . . . . . . . . . . . . . . . . 471
16.4.1 The failure of naive margin- and hinge-type losses . . . . . . . . . . . . . . . 474
16.4.2 Structured prediction losses via the generalized entropy . . . . . . . . . . . . 476
16.4.3 Proof of Theorem 16.4.9 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 479
16.5 Universal loss equivalence and entropies . . . . . . . . . . . . . . . . . . . . . . . . . 480
16.5.1 Proof of Theorem 16.5.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 483
16.5.2 Proof of Lemma 16.5.5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 485
16.5.3 Proof of Lemma 16.5.7 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 486
16.6 Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 486
16.7 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 486

IV Online game playing and compression 490

17 Stochastic and online convex optimization 491


17.1 Preliminaries on convex optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . 492
17.2 Online convex optimization methods . . . . . . . . . . . . . . . . . . . . . . . . . . . 493
17.2.1 Projected subgradient methods . . . . . . . . . . . . . . . . . . . . . . . . . . 494
17.2.2 Mirror descent-type methods . . . . . . . . . . . . . . . . . . . . . . . . . . . 496
17.2.3 Convergence analysis of mirror descent . . . . . . . . . . . . . . . . . . . . . . 498
17.2.4 Instantiations of the regret guarantee . . . . . . . . . . . . . . . . . . . . . . 499
17.2.5 Proof of Theorem 17.2.9 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 501
17.3 Optimality guarantees and fundamental limits . . . . . . . . . . . . . . . . . . . . . . 503
17.3.1 From optimization to testing . . . . . . . . . . . . . . . . . . . . . . . . . . . 504
17.3.2 Constructing hard classes of optimization problems . . . . . . . . . . . . . . . 506
17.3.3 Instantiations and optimality . . . . . . . . . . . . . . . . . . . . . . . . . . . 509
17.3.4 A lower bound for high-dimensional stochastic optimization . . . . . . . . . . 512
17.4 Online to batch conversions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 513
17.5 More refined convergence guarantees . . . . . . . . . . . . . . . . . . . . . . . . . . . 514
17.5.1 Proof of Proposition 17.5.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 515
17.6 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 517

18 Exploration, exploitation, and bandit problems 523


18.1 The multi-armed bandit problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 523
18.2 Confidence-based algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 525
18.3 General losses and information-based bounds . . . . . . . . . . . . . . . . . . . . . . 529

7
Lexture Notes on Statistics and Information Theory John Duchi

18.3.1 An information-based regret bound . . . . . . . . . . . . . . . . . . . . . . . . 530


18.3.2 Posterior (Thompson) sampling . . . . . . . . . . . . . . . . . . . . . . . . . . 533
18.3.3 Information-based exploration . . . . . . . . . . . . . . . . . . . . . . . . . . . 536
18.3.4 An extended example: linear bandits . . . . . . . . . . . . . . . . . . . . . . . 539
18.4 Online gradient descent approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . . 541
18.4.1 Some empirical comparisons . . . . . . . . . . . . . . . . . . . . . . . . . . . . 544
18.5 Minimax lower bounds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 545
18.5.1 Action separation and a modulus of continuity . . . . . . . . . . . . . . . . . 546
18.5.2 Assoaud’s method for lower bounds . . . . . . . . . . . . . . . . . . . . . . . 549
18.5.3 Proof of Theorem 18.5.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 552
18.6 Technical proofs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 553
18.6.1 Proof of Lemma 18.2.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 553
18.7 Further notes and references . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 554
18.8 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 556

19 Minimax games and Bayesian estimation 558


19.1 Robust Bayesian procedures and maximum entropy . . . . . . . . . . . . . . . . . . . 559
19.1.1 A digression on min-max games . . . . . . . . . . . . . . . . . . . . . . . . . . 560
19.1.2 Saddle points for maximum entropy . . . . . . . . . . . . . . . . . . . . . . . 561
19.1.3 Exponential family models as robust Bayesian procedures . . . . . . . . . . . 561
19.2 The coding game and sequential prediction . . . . . . . . . . . . . . . . . . . . . . . 563
19.3 Expected regret, information capacity, and redundancy . . . . . . . . . . . . . . . . . 565
19.3.1 Information capacity and regret duality . . . . . . . . . . . . . . . . . . . . . 566
19.3.2 Instantiations and corollaries of regret/capacity duality . . . . . . . . . . . . 569
19.3.3 Maximum generalized entropy and Robust Bayesian procedures . . . . . . . . 570
19.3.4 Proof of Lemma 19.3.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 572
19.4 Minimax strategies for regret . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 573
19.5 Mixture (Bayesian) strategies and redundancy . . . . . . . . . . . . . . . . . . . . . . 575
19.5.1 Bayesian redundancy and objective, reference, and Jeffreys priors . . . . . . . 578
19.5.2 Heuristic calculations: normality and Theorem 19.5.1 . . . . . . . . . . . . . 580
19.6 Regret and capacity dualities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 581
19.6.1 Duality when the domain is finite . . . . . . . . . . . . . . . . . . . . . . . . . 581
19.6.2 Proof of Corollary 19.3.4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 582
19.6.3 Regret/capacity duality for arbitrary domains . . . . . . . . . . . . . . . . . . 584
19.6.4 A formal statement of regret/capacity duality . . . . . . . . . . . . . . . . . . 588
19.7 Bibliographic details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 589
19.8 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 589

V Appendices 592

A Miscellaneous mathematical results 593


A.1 The roots of a polynomial . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 593
A.2 Measure-theoretic development of divergence measures . . . . . . . . . . . . . . . . . 593
A.3 Integral convergence and completeness of probability spaces . . . . . . . . . . . . . . 593
A.4 Probabilistic convergence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 594
A.4.1 Classical results on convergence in distribution . . . . . . . . . . . . . . . . . 594

8
Lexture Notes on Statistics and Information Theory John Duchi

A.4.2 Assorted convergence results for probability distributions . . . . . . . . . . . 595


A.5 Stirling approximations and entropy . . . . . . . . . . . . . . . . . . . . . . . . . . . 598

B Convex Analysis 600


B.1 Convex sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 600
B.1.1 Operations preserving convexity . . . . . . . . . . . . . . . . . . . . . . . . . 602
B.1.2 Representation and separation of convex sets . . . . . . . . . . . . . . . . . . 604
B.2 Sublinear and support functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 608
B.3 Convex functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 611
B.3.1 Equivalent definitions of convex functions . . . . . . . . . . . . . . . . . . . . 612
B.3.2 Continuity properties of convex functions . . . . . . . . . . . . . . . . . . . . 614
B.3.3 Operations preserving convexity . . . . . . . . . . . . . . . . . . . . . . . . . 620
B.3.4 Smoothness properties, first-order developments for convex functions, and
subdifferentiability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 623
B.3.5 Calculus rules of subgradients . . . . . . . . . . . . . . . . . . . . . . . . . . . 628

C Optimality, stability, and duality 631


C.1 Optimality conditions and stability properties . . . . . . . . . . . . . . . . . . . . . . 632
C.1.1 Subgradient characterizations for optimality . . . . . . . . . . . . . . . . . . . 632
C.1.2 Stability properties of minimizers . . . . . . . . . . . . . . . . . . . . . . . . . 634
C.2 Conjugacy and duality properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 639
C.2.1 Gradient dualities and the Fenchel-Young inequality . . . . . . . . . . . . . . 640
C.2.2 Smoothness and strict convexity of conjugates . . . . . . . . . . . . . . . . . . 641
C.2.3 Smooth convex functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 643
C.3 Limits at infinity of convex functions and sets . . . . . . . . . . . . . . . . . . . . . . 645
C.3.1 Boundedness and closedness of convex sets . . . . . . . . . . . . . . . . . . . 646
C.3.2 Asymptotic growth and existence of minimizers . . . . . . . . . . . . . . . . . 649
C.4 Saddle point theorems and min-max duality . . . . . . . . . . . . . . . . . . . . . . . 651
C.4.1 Saddle points and convex conjugates . . . . . . . . . . . . . . . . . . . . . . . 652
C.4.2 Min-max duality and the existence of saddle points . . . . . . . . . . . . . . . 654
C.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 655

9
Chapter 1

Introduction and setting

This book explores some of the (many) connections relating information theory, statistics, computa-
tion, and learning. Signal processing, machine learning, and statistics all revolve around extracting
useful information from signals and data. In signal processing and information theory, a central
question is how to best design signals—and the channels over which they are transmitted—to max-
imally communicate and store information, and to allow the most effective decoding. In machine
learning and statistics, by contrast, it is often the case that nature provides a fixed data distri-
bution, and it is the learner’s or statistician’s goal to recover information about this (unknown)
distribution. Our goal will be to show how a information theoretic perspectives can provide clean
answers about and techniques to perform this recovery.
The discovery of fundamental limits forms a central aspect of information theory: the develop-
ment of results that demonstrate that certain procedures are optimal. Thus, information theoretic
tools allow a characterization of the attainable results in a variety of communication and statis-
tical settings. As we explore in the coming chapters in the context of statistical, inferential, and
machine learning tasks, this allows us to develop procedures whose optimality we can certify—no
better procedure is possible. Such results are useful for a myriad of reasons; we would like to avoid
making bad decisions or false inferences, we may realize a task is impossible, and we can explicitly
calculate the amount of data necessary for solving different statistical problems.

1.1 Information theory


Information theory focuses on a plethora of deep questions: What is information? How much
information content do various signals and data hold? How much information can be reliably
transmitted over a noisy communication channel? We will leave delineation of the discipline and
answers to these questions to information theorists, instead grossly oversimplifying information
theory into two main inquiries, with corresponding chains of tasks.

1. How much information does a signal contain?

2. How much information can a noisy channel reliably transmit?

In this context, we provide two main high-level examples, one for each of these tasks.

Example 1.1.1 (Source coding): The source coding, or data compression problem, is to
take information from a source, compress it, decompress it, and recover the original message.

10
Lexture Notes on Statistics and Information Theory John Duchi

Graphically, we have

Source → Compressor → Decompressor → Receiver

The question, then, is how to design a compressor (encoder) and decompressor (decoder) that
uses the fewest number of bits to describe a source (or a message) while preserving all the
information, in the sense that the receiver receives the correct message with high probability.
This fewest number of bits is then the information content of the source (signal). 3

Example 1.1.2: The channel coding, or data transmission problem, is the same as the source
coding problem of Example 1.1.1, except that between the compressor and decompressor is a
source of noise, a channel. The graphical representation becomes

Source → Compressor → Channel → Decompressor → Receiver

Here we investigate the maximum number of bits that may be sent per each channel use in
the sense that the receiver can reconstruct the desired message with low probability of error.
Because the channel introduces noise, we require some redundancy, and information theory
studies the exact amount of redundancy—in the form of additional bits—that must be sent to
allow such reconstruction. 3

1.2 Moving to statistics and machine learning


We advocate a study of statistics and machine learning that—broadly—keeps in mind the same
views. Let us attempt, then, to shoehorn statistics and machine learning into such source coding and
a channel coding problems, which will help to illuminate the perspective that information-theoretic
techniques give.
In the analogy with source coding, we observe a sequence of data points X1 , . . . , Xn drawn from
some (unknown) distribution P on a space X . For example, we might be observing species that
biologists collect. Then by analogy, we construct a model (often a generative model) that encodes
the data using relatively few bits: that is,

X1 ,...,Xn Pb
Source (P ) −→ Compressor → Decompressor → Receiver.

Here, we estimate Pb—an empirical version of the distribution P that is easier to describe than
the original signal X1 , . . . , Xn —with the hope that we learn information about the generating
distribution P , or at least describe it efficiently.
In our analogy with channel coding we can connect to estimation and inference. Consider a
statistical problem in which there exists some unknown function f on a space X that we wish
to estimate, and we are able to observe a noisy version of f (Xi ) for a series of Xi drawn from
a distribution P . Recalling the graphical description of Example 1.1.2, we now have a channel
P (Y | f (X)) that gives us noisy observations of f (X) for each Xi , but we generally no longer
choose the encoder/compressor. That is, we have
X1 ,...,Xn f (X1 ),...,f (Xn ) Y1 ,...,Yn
Source (P ) −→ Compressor −→ Channel P (Y | f (X)) −→ Decompressor.

The estimation—decompression—problem is to either estimate f , or, in some cases, to estimate


other aspects of the source probability distribution P . In statistical problems, we do not have

11
Lexture Notes on Statistics and Information Theory John Duchi

any choice in the design of the compressor f that transforms the original signal X1 , . . . , Xn , which
makes it somewhat different from traditional ideas in information theory. In some cases that we
explore later—such as experimental design, randomized controlled trials, reinforcement learning
and bandits (and associated exploration/exploitation tradeoffs)—we are also able to influence the
compression part of the above scheme.

Example 1.2.1: A classical example of the statistical paradigm in this lens is the usual linear
regression problem. Here the data Xi belong to Rd , and the compression function f (x) = θ⊤ x
for some vector θ ∈ Rd . Then the channel is often of the form

Yi = θ⊤ Xi + εi ,
| {z } |{z}
signal noise

iid
where εi ∼ N(0, σ 2 ) are independent mean zero normal perturbations. Given a sequence of
pairs (Xi , Yi ), we wish to recover the true θ in the linear model.
In active learning or active sensing scenarios, also known as (sequential) experimental
design, we may choose the sequence Xi so as to better explore properties of θ. As one concrete
idea, if we allow infinite power, which in this context corresponds to letting ∥Xi ∥ → ∞—
choosing very “large” vectors xi —then the signal of θ⊤ Xi should swamp any noise and make
estimation easier. 3

The remainder of book explores these ideas.

1.3 Outline and chapter discussion


I divide the book into four distinct parts, each of course interacting with the others, but it is possible
to read each as a reasonably self-contained unit. The book begins with a revew (Chapter 2)
that introduces the basic information-theoretic quantities that we discuss: mutual information,
entropy, and divergence measures. It is required reading for all the chapters that follow. Chapter 3
provides an overview of exponential family models, which form a core tool in the statistical learning
toolbox. Readers familiar with this material, perhaps via a course on generalized linear models,
can certainly skip this, but it provides a useful grounding for examples and applications in the
subsequent chapters, and so we will dip back into it throughout the book.
Part I of the book covers what I term “stability” based results. At a high level, this means that
we ask what can be gained by considering situations where individual observations in a sequence
of random variables X1 , . . . , Xn have little effect on various functions of the sequence. We begin in
Chapter 4 with concentration inequalities, discussing how sums and related quantities can converge
quickly; while this material is essential for the remainder of the chapters, it does not depend on
particular information-theoretic techniques. We discuss some heuristic applications to problems in
statistical learning—empirical risk minimization—in this section of the book, with Chapter 5 pro-
viding results on uniform concentration, with applications to both “generalization”—the standard
theoretical tool in machine learning, most typically applying to the accuracy of prediction models—
and to estimation problems, which provide various guarantees on estimation of model parameters,
which constitute core statistical problems and techniques.
We then turn in Chapter 6 to carefully investigate generalization and convergence guarantees—
arguing that functions of a sample X1 , . . . , Xn are representative of the full population P from
which the sample is drawn—based on controlling different information-theoretic quantities. In this

12
Lexture Notes on Statistics and Information Theory John Duchi

context, we develop PAC-Bayesian bounds, and we also use the same framework to present tools to
control generalization and convergence in interactive data analyses. These types of analyses reflect
modern statistics, where one performs some type of data exploration before committing to a fuller
analysis, but which breaks classical statistical approaches, because the analysis now depends on
the sample. We provide a treatment of more advanced ideas in Chapter 7, where we develop more
sophisticated concentration results, such as on random matrices, using core ideas from information
theory, which allow us to connect divergence measures to different random processes. Finally, we
provide a chapter (Chapter 8) on disclosure limitation and privacy techniques, all of which repose
on different notions of stability in distribution.
Part II studies fundamental limits, using information-theoretic techniques to derive lower
bounds on the possible rates of convergence for various estimation, learning, and other statistical
problems. Chapter 9 kicks things off by developing the three major methods for lower bounds:
the Assouad, Fano, and Le Cam methods. This chapter shows the basic techniques from which all
the other lower bound ideas follow. At a high level, we might consider it, along with Part I, as
exhibiting the entire object of study of this book: how do distributions get close to one another, and
how can we leverage that closeness? We give a brief treatment of some lower bounding techniques
beyond these approaches in Chapter 10, including applications to certain nonparametric problems,
as well as a few results that move beyond the typical lower bounds, which apply in expectation,
to some that mimic “strong converses” in information theory, meaning that with exceedingly high
probability, one cannot hope to achieve anything better than average case error guarantees.
In modern statistical learning problems, one frequently has concerns beyond just statistical risk,
such as communication or computational cost, or the privacy of study participants. Accordingly,
we develop some of the recent techniques for such problems in Chapter 11 on problems where we
wish to obtain optimality guarnatees simultaneously along many dimensions, connecting to com-
munication complexity ideas from information theory. Chapter 12 provides a bit of a throwback to
estimation with squared error—the most common error metric—introducing the classical statistical
tools we have, but shows a few of the more modern applications of the ideas, which re-appear with
some frequency. Finally, we conclude the discussion of fundamental limits by looking at testing
problems and functional estimation, where one wishes to only estimate a single parameter of a
larger model (Chapter 13). While estimating a single scalar might seem, a priori, to be simpler
than other problems, adequately addressing its complexity requires a fairly nuanced treatment and
the introduction of careful information-theoretic tools.
Part III revisits all of our information theoretic notions from Chapter 2, but instead of simply
giving definitions and a few consequences, provides operational interpretations of the different
information-theoretic quantities, such as entropy. Of course this includes Shannon’s original results
on the relationship between coding and entropy (which we cover in the overview Chapter 2.4.1 on
information theory), but we also provide an interpretation of entropy and information as measures
of uncertainty in statistical experiments and statistical learning, which is a perspective typically
missing from information-theoretic treatments of entropy (Chapter 14). Our treatment shows a
deep connection between entropy and loss functions used for prediction, where a particular duality
allows moving back and forth between them.
We connect these ideas to the problem of calibration in Chapter 15, where we ask that a
prediction model be valid in that, e.g., on 75% of the days the model provides a prediction of 75% of
rain, it rains. We are also able to use these information-theoretic notions of risk, entropy, and losses
to connect to problems in optimization and machine learning. In particular, Chapter 16 explores the
ways that, if instead of fitting a model to some “true” loss we use an easier-to-optimize surrogate,
we essentially lose nothing. This allows us to delineate when (at least in asymptotic senses) it

13
Lexture Notes on Statistics and Information Theory John Duchi

is possible to computationally efficiently learn good predictors and design good experiments in
statistical machine learning problems. Because of the connections with optimization and convex
duality, these chapters repose on a nontrivial foundation of convex analysis; we include Appendices
(Appendix B and C) that provide a fairly comprehensive review of the results we require. For
readers unfamiliar with convex optimization and analysis, I will be the first to admit that these
chapters may be tough going—accordingly, we attempt to delineate the big-picture ideas from the
nitty-gritty technical conditions necessary for the most general results.
Part IV finishes the book with a treatment of stochastic optimization, online game playing,
and minimax problems. Our approach in Chapter 17 takes a modern perspective on stochastic
optimization as minimizing random models of functions, and it includes the “book” proofs of
convergence of the workhorses of modern machine learning optimization. It also leverages the
earlier results on fundamental limits to develop optimality theory for convex optimization in the
same framework. Chapter 18 explores online decision-making problems and, more broadly, problems
that require exploration and exploitation. This includes bandit problems and some basic questions
in causal estimation, where information-theoretic tools allow a clean treatment. The concluding
Chapter 19 revisits Chapter 14 on loss functions and predictions, but considers it more in the
context of particular games between nature and a statistician/learner. Once again leveraging the
perspective on entropy and loss functions we have developed, we are able to provide a generalization
of the celebrated redundancy/capacity theorem from information theory, but recast as a game of
loss minimization against a nature.

1.4 A remark about measure theory


As this book focuses on a number of fundamental questions in statistics, machine learning, and
information theory,
R fully general Rstatements of the results often require measure theory. Thus,
formulae such as f (x)dP (x) or f (x)dµ(x) appear. While knowledge of measure theory is cer-
tainly useful and may help appreciate the results, it is completely inessential to developing the
intuition and, I hope, understanding the proofs and main results. Indeed, the best strategy (for
a reader unfamiliar with measure theory) is to simply replace every instance of a formula such as
dµ(x) with dx. The most frequent cases we encounter will be the following: we wish to compute
the expectation of a function f of randomR variable X following distribution P , that
R is, EP [f (X)].
Normally, we would write EP [f (X)] = f (x)dP (x), or sometimes EP [f (X)] = f (x)p(x)dµ(x),
saying that “P has density p with respect to the underlying measure µ.” Instead, one may simply
(and intuitively) assume that x really has density p over the reals, and instead of computing the
integral Z Z
EP [f (X)] = f (x)dP (x) or EP [f (X)] = f (x)p(x)dµ(x),

assume we may write Z


EP [f (X)] = f (x)p(x)dx.

Nothing will be lost.

14
Chapter 2

An information theory review

In this first introductory chapter, we discuss and review many of the basic concepts of information
theory in effort to introduce them to readers unfamiliar with the tools. Our presentation is relatively
brisk, as our main goal is to get to the meat of the chapters on applications of the inequalities and
tools we develop, but these provide the starting point for everything in the sequel. One of the
main uses of information theory is to prove what, in an information theorist’s lexicon, are known
as converse results: fundamental limits that guarantee no procedure can improve over a particular
benchmark or baseline. We will give the first of these here to preview more of what is to come,
as these fundamental limits form one of the core connections between statistics and information
theory. The tools of information theory, in addition to their mathematical elegance, also come
with strong operational interpretations: they give quite precise answers and explanations for a
variety of real engineering and statistical phenomena. We will touch on one of these here (the
connection between source coding, or lossless compression, and the Shannon entropy), and much
of the remainder of the book will explore more.

2.1 Basics of Information Theory


In this section, we review the basic definitions in information theory, including (Shannon) entropy,
KL-divergence, mutual information, and their conditional versions. Before beginning, I must make
an apology to any information theorist reading these notes: any time we use a log, it will always
be base-e. This is more convenient for our analyses, and it also (later) makes taking derivatives
much nicer.
In this first section, we will assume that all distributions are discrete; this makes the quantities
somewhat easier to manipulate and allows us to completely avoid any complicated measure-theoretic
quantities. In Section 2.2 of this note, we show how to extend the important definitions (for our
purposes)—those of KL-divergence and mutual information—to general distributions, where basic
ideas such as entropy no longer make sense. However, even in this general setting, we will see we
essentially lose no generality by assuming all variables are discrete.

2.1.1 Definitions
Here, we provide the basic definitions of entropy, information, and divergence, assuming the random
variables of interest are discrete or have densities with respect to Lebesgue measure.

15
Lexture Notes on Statistics and Information Theory John Duchi

Entropy: We begin with a central concept in information theory: the entropy. Let P be a distri-
bution on a finite (or countable) set X , and let p denote the probability mass function associated
with P . That is, if X is a random variable distributed according to P , then P (X = x) = p(x). The
entropy of X (or of P ) is defined as
X
H(X) := − p(x) log p(x).
x

Because p(x) ≤ 1 for all x, it is clear that this quantity is positive. We will show later that if X
is finite, the maximum entropy distribution on X is the uniform distribution, setting p(x) = 1/|X |
for all x, which has entropy log(|X |).
Later in the class, we provide a number of operational interpretations of the entropy. The
most common interpretation—which forms the beginning of Shannon’s classical information the-
ory [167]—is via the source-coding theorem. We present Shannon’s source coding theorem in
Section 2.4.1, where we show that if we wish to encode a random variable X, distributed according
to P , with a k-ary string (i.e. each entry of the string takes
P on one of k values), then the minimal
expected length of the encoding is given by H(X) = − x p(x) logk p(x). Moreover, this is achiev-
able (to within a length of at most 1 symbol) by using Huffman codes (among many other types of
codes). As an example of this interpretation, we may consider encoding a random variable X with
equi-probable distribution on m items, which has H(X) = log(m). In base-2, this makes sense: we
simply assign an integer to each item and encode each integer with the natural (binary) integer
encoding of length ⌈log m⌉.
We can also define the conditional entropy, which is the amount of information left in a random
variable after observing another. In particular, we define
X X
H(X | Y = y) = − p(x | y) log p(x | y) and H(X | Y ) = p(y)H(X | Y = y),
x y

where p(x | y) is the p.m.f. of X given that Y = y.


Let us now provide a few examples of the entropy of various discrete random variables
Example 2.1.1 (Uniform random variables): As we noted earlier, if a random variable X is
uniform on a set of size m, then H(X) = log m. 3

Example 2.1.2 (Bernoulli random variables): Let h2 (p) = −p log p − (1 − p) log(1 − p) denote
the binary entropy, which is the entropy of a Bernoulli(p) random variable. 3

Example 2.1.3 (Geometric random variables): A random variable X is Geometric(p), for


some p ∈ [0, 1], if it is supported on {1, 2, . . .}, and P (X = k) = (1 − p)k−1 p; this is the
probability distribution of the number X of Bernoulli(p) trials until a single success. The
entropy of such a random variable is

X ∞
X
k−1
H(X) = − (1 − p) p [(k − 1) log(1 − p) + log p] = − (1 − p)k p [k log(1 − p) + log p] .
k=1 k=0
P∞ k 1 d 1 1 P∞ k−1 ,
As k=0 α = 1−α and dα 1−α = (1−α)2
= k=1 kα we have
∞ ∞
X X 1−p
H(X) = −p log(1 − p) · k(1 − p)k − p log p · (1 − p)k = − log(1 − p) − (1 − p) log p.
p
k=1 k=1

As p ↓ 0, we see that H(X) ↑ ∞. 3

16
Lexture Notes on Statistics and Information Theory John Duchi

Example 2.1.4 (A random variable with infinite entropy): While most “reasonable” discrete
random variables have finite entropy, it is possible to construct distributions with infinite
entropy. Indeed, let X have p.m.f. on {2, 3, . . .} defined by

A −1
X 1
p(k) = 2 where A = < ∞,
k log k k=2
k log2 k
R∞ Rx
the last sum finite as 2 x log1 α x dx < ∞ if and only if α > 1: for α = 1, we have e 1
t log t =
log log x, while for α > 1, we have
d 1
(log x)1−α = (1 − α)
dx x logα x
R∞ 1 1
so that e t logα t dt = e(1−α) . To see that the entropy is infinite, note that
X log A + log k + 2 log log k X log k
H(X) = A 2 ≥A 2 − C = ∞,
k≥2
k log k k≥2
k log k

where C is a numerical constant. 3

KL-divergence: Now we define two additional quantities, which are actually much more funda-
mental than entropy: they can always be defined for any distributions and any random variables,
as they measure distance between distributions. Entropy simply makes no sense for non-discrete
random variables, let alone random variables with continuous and discrete components, though it
proves useful for some of our arguments and interpretations.
Before defining these quantities, we recall the definition of a convex function f : Rk → R as any
bowl-shaped function, that is, one satisfying
f (λx + (1 − λ)y) ≤ λf (x) + (1 − λ)f (y) (2.1.1)
for all λ ∈ [0, 1], all x, y. The function f is strictly convex if the convexity inequality (2.1.1) is
strict for λ ∈ (0, 1) and x ̸= y. We recall a standard result:
Proposition 2.1.5 (Jensen’s inequality). Let f be convex. Then for any random variable X,
f (E[X]) ≤ E[f (X)].
Moreover, if f is strictly convex, then f (E[X]) < E[f (X)] unless X is constant.
Now we may define and provide a few properties of the KL-divergence. Let P and Q be
distributions defined on a discrete set X . The KL-divergence between them is
X p(x)
Dkl (P ||Q) := p(x) log .
q(x)
x∈X

We observe immediately that Dkl (P ||Q) ≥ 0. To see this, we apply Jensen’s inequality (Propo-
sition 2.1.5) to the function − log and the random variable q(X)/p(X), where X is distributed
according to P :
   
q(X) q(X)
Dkl (P ||Q) = −E log ≥ − log E
p(X) p(X)
X 
q(x)
= − log p(x) = − log(1) = 0.
x
p(x)

17
Lexture Notes on Statistics and Information Theory John Duchi

Moreover, as log is strictly convex, we have Dkl (P ||Q) > 0 unless P = Q. Another consequence of
the positivity of the KL-divergence is that whenever the set X is finite with cardinality |X | < ∞,
for any random variable X supported on X we have H(X) ≤ log |X |. Indeed, letting m = |X |, Q
1
be the uniform distribution on X so that q(x) = m , and X have distribution P on X , we have
X p(x) X
0 ≤ Dkl (P ||Q) = p(x) log = −H(X) − p(x) log q(x) = −H(X) + log m, (2.1.2)
x
q(x) x

so that H(X) ≤ log m. Thus, the uniform distribution has the highest entropy over all distributions
on the set X .

Mutual information: Having defined KL-divergence, we may now describe the information
content between two random variables X and Y . The mutual information I(X; Y ) between X and
Y is the KL-divergence between their joint distribution and their products (marginal) distributions.
More mathematically,
X p(x, y)
I(X; Y ) := p(x, y) log . (2.1.3)
x,y
p(x)p(y)

We can rewrite this in several ways. First, using Bayes’ rule, we have p(x, y)/p(y) = p(x | y), so
X p(x | y)
I(X; Y ) = p(y)p(x | y) log
x,y
p(x)
XX X X
=− p(y)p(x | y) log p(x) + p(y) p(x | y) log p(x | y)
x y y x

= H(X) − H(X | Y ).

Similarly, we have I(X; Y ) = H(Y ) − H(Y | X), so mutual information can be thought of as the
amount of entropy removed (on average) in X by observing Y . We may also think of mutual infor-
mation as measuring the similarity between the joint distribution of X and Y and their distribution
when they are treated as independent.
Comparing the definition (2.1.3) to that for KL-divergence, we see that if PXY is the joint
distribution of X and Y , while PX and PY are their marginal distributions (distributions when X
and Y are treated independently), then

I(X; Y ) = Dkl (PXY ||PX × PY ) ≥ 0.

Moreover, we have I(X; Y ) > 0 unless X and Y are independent.


As with entropy, we may also define the conditional information between X and Y given Z,
which is the mutual information between X and Y when Z is observed (on average). That is,
X
I(X; Y | Z) := I(X; Y | Z = z)p(z) = H(X | Z) − H(X | Y, Z) = H(Y | Z) − H(Y | X, Z).
z

Entropies of continuous random variables For continuous random variables, we may define
an analogue of the entropy known as differential entropy, which for a random variable X with
density p is defined by Z
h(X) := − p(x) log p(x)dx. (2.1.4)

18
Lexture Notes on Statistics and Information Theory John Duchi

Note that the differential entropy may be negative—it is no longer directly a measure of the number
of bits required to describe a random variable X (on average), as was the case for the entropy. We
can similarly define the conditional entropy
Z Z
h(X | Y ) = − p(y) p(x | y) log p(x | y)dxdy.

We remark that the conditional differential entropy of X given Y for Y with arbitrary distribution—
so long as X has a density—is
 Z 
h(X | Y ) = E − p(x | Y ) log p(x | Y )dx ,

where p(x | y) denotes the conditional density of X when Y = y. The KL divergence between
distributions P and Q with densities p and q becomes
Z
p(x)
Dkl (P ||Q) = p(x) log dx,
q(x)
and similarly, we have the analogues of mutual information as
Z
p(x, y)
I(X; Y ) = p(x, y) log dxdy = h(X) − h(X | Y ) = h(Y ) − h(Y | X).
p(x)p(y)
As we show in the next subsection, we can define the KL-divergence between arbitrary distributions
(and mutual information between arbitrary random variables) more generally without requiring
discrete or continuous distributions. Before investigating these issues, however, we present a few
examples. We also see immediately that for X uniform on a set [a, b], we have h(X) = log(b − a).
Example 2.1.6 (Entropy of normal random variables): The differential entropy (2.1.4) of
a normal random variable is straightforward to compute. Indeed, for X ∼ N(µ, σ 2 ) we have
p(x) = √ 1 2 exp(− 2σ1 2 (x − µ)2 ), so that
2πσ

E[(X − µ)2 ]
Z  
1 1 1 2 1 2 1
h(X) = − p(x) log − (x − µ) = log(2πσ ) + = log(2πeσ 2 ).
2 2πσ 2 2σ 2 2 2σ 2 2
For a general multivariate Gaussian, where X ∼ N(µ, Σ) for a vector µ ∈ Rn and Σ ≻ 0 with
density p(x) = n/2
1
√ exp(− 21 (x − µ)⊤ Σ−1 (x − µ)), we similarly have
(2π) det(Σ)

1 h i
h(X) = E n log(2π) + log det(Σ) + (X − µ)⊤ Σ−1 (X − µ)
2
n 1 1 n 1
= log(2π) + log det(Σ) + tr(ΣΣ−1 ) = log(2πe) + log det(eΣ).
2 2 2 2 2
3
Continuing our examples with normal distributions, we may compute the divergence between
two multivariate Gaussian distributions:
Example 2.1.7 (Divergence between Gaussian distributions): Let P be the multivariate
normal N(µ1 , Σ), and Q be the multivariate normal distribution with mean µ2 and identical
covariance Σ ≻ 0. Then we have that
1
Dkl (P ||Q) = (µ1 − µ2 )⊤ Σ−1 (µ1 − µ2 ). (2.1.5)
2
We leave the computation of the identity (2.1.5) to the reader. 3

19
Lexture Notes on Statistics and Information Theory John Duchi

An interesting consequence of Example 2.1.7 is that if a random vector X has a given covari-
ance Σ ∈ Rn×n , then the multivariate Gaussian with identical covariance has larger differential
entropy. Put another way, differential entropy for random variables with second moments is always
maximized by the Gaussian distribution.
Proposition 2.1.8. Let X be a random vector on Rn with a density, and assume that Cov(X) = Σ.
Then for Z ∼ N(0, Σ), we have
h(X) ≤ h(Z).
Proof Without loss of generality, we assume that X has mean 0. Let P be the distribution of
X with density p, and let Q be multivariate normal with mean 0 and covariance Σ; let Z be this
random variable. Then
Z Z  
p(x) n 1 ⊤ −1
Dkl (P ||Q) = p(x) log dx = −h(X) + p(x) log(2π) − x Σ x dx
q(x) 2 2
= −h(X) + h(Z),
because Z has the same covariance as X. As 0 ≤ Dkl (P ||Q), we have h(Z) ≥ h(X) as desired.

We remark in passing that the fact that Gaussian random variables have the largest entropy has
been used to prove stronger variants of the central limit theorem; see the original results of Barron
[16], as well as later quantitative results on the increase of entropy of normalized sums by Artstein
et al. [9] and Madiman and Barron [143].

2.1.2 Chain rules and related properties


We now illustrate several of the properties of entropy, KL divergence, and mutual information;
these allow easier calculations and analysis.

Chain rules: We begin by describing relationships between collections of random variables


X1 , . . . , Xn and individual members of the collection. (Throughout, we use the notation Xij =
(Xi , Xi+1 , . . . , Xj ) to denote the sequence of random variables from indices i through j.)
For the entropy, we have the simplest chain rule:
H(X1 , . . . , Xn ) = H(X1 ) + H(X2 | X1 ) + . . . + H(Xn | X1n−1 ).
This follows from the standard decomposition of a probability distribution p(x, y) = p(x)p(y | x).
to see the chain rule, then, note that
X
H(X, Y ) = − p(x)p(y | x) log p(x)p(y | x)
x,y
X X X X
=− p(x) p(y | x) log p(x) − p(x) p(y | x) log p(y | x) = H(X) + H(Y | X).
x y x y

Now set X = X1n−1 ,Y = Xn , and simply induct.


A related corollary of the definitions of mutual information is the well-known result that con-
ditioning reduces entropy:
H(X | Y ) ≤ H(X) because I(X; Y ) = H(X) − H(X | Y ) ≥ 0.
So on average, knowing about a variable Y can only decrease your uncertainty about X. That
conditioning reduces entropy for continuous random variables is also immediate, as for X continuous
we have I(X; Y ) = h(X) − h(X | Y ) ≥ 0, so that h(X) ≥ h(X | Y ).

20
Lexture Notes on Statistics and Information Theory John Duchi

Chain rules for information and divergence: As another immediate corollary to the chain
rule for entropy, we see that mutual information also obeys a chain rule:
n
X
I(X; Y1n ) = I(X; Yi | Y1i−1 ).
i=1

Indeed, we have
n
X n
 X
I(X; Y1n ) = H(Y1n ) − H(Y1n | X) = H(Yi | Y1i−1 ) − H(Yi | X, Y1i−1 ) = I(X; Yi | Y1i−1 ).


i=1 i=1

The KL-divergence obeys similar chain rules, making mutual information and KL-divergence mea-
sures useful tools for evaluation of distances and relationships between groups of random variables.
As a second example, suppose that the distribution P = P1 ×P2 ×· · ·×Pn , and Q = Q1 ×· · ·×Qn ,
that is, that P and Q are product distributions over independent random variables Xi ∼ Pi or
Xi ∼ Qi . Then we immediately have the tensorization identity
n
X
Dkl (P ||Q) = Dkl (P1 × · · · × Pn ||Q1 × · · · × Qn ) = Dkl (Pi ||Qi ) .
i=1

We remark in passing that these two identities hold for arbitrary distributions Pi and Qi or random
variables X, Y . As a final tensorization identiy, we consider a more general chain rule for KL-
divergences, which will frequently be useful. We abuse notation temporarily, and for random
variables X and Y with distributions P and Q, respectively, we denote

Dkl (X||Y ) := Dkl (P ||Q) .

In analogy to the entropy, we can also define the conditional KL divergence. Let X and Y have
distributions PX|z and PY |z conditioned on Z = z, respectively. Then we define

Dkl (X||Y | Z) = EZ [Dkl PX|Z ||PY |Z ],
P 
so that if Z is discrete we have Dkl (X||Y | Z) = z p(z)Dkl PX|z ||PY |z . With this notation, we
have the chain rule
n
X
Dkl Xi ||Yi | X1i−1 ,

Dkl (X1 , . . . , Xn ||Y1 , . . . , Yn ) = (2.1.6)
i=1

because (in the discrete case, which—as we discuss presently—is fully general for this purpose) for
distributions PXY and QXY we have
 
X p(x, y) X p(y | x) p(x)
Dkl (PXY ||QXY ) = p(x, y) log = p(x)p(y | x) log + log
x,y
q(x, y) x,y
q(y | x) q(x)
X p(x) X X p(y | x)
= p(x) log + p(x) p(y | x) log ,
x
q(x) x y
q(y | x)
P
where the final equality uses that y p(y | x) = 1 for all x. In different notation, if we let P and
Q be any distributions on X1 × · · · × Xn , and define Pi (A | xi−1 i−1
1 ) = P (Xi ∈ A | X1 = x1i−1 ), and
similarly for Qi , we have the following:

21
Lexture Notes on Statistics and Information Theory John Duchi

Lemma 2.1.9. Let P, Q be distributions on X1 × · · · × Xn . Then


n
X
EP [Dkl Pi (· | X1i−1 )||Qi (· | X1i−1 ) ].

Dkl (P ||Q) =
i=1

Expanding upon this, we give several tensorization identities, showing how to transform ques-
tions about the joint distribution of many random variables to simpler questions about their
marginals. As a first example, we see that as a consequence of the fact that conditioning de-
creases entropy, we see that for any sequence of (discrete or continuous, as appropriate) random
variables, we have

H(X1 , . . . , Xn ) ≤ H(X1 ) + · · · + H(Xn ) and h(X1 , . . . , Xn ) ≤ h(X1 ) + . . . + h(Xn ).

Both equalities hold with equality if and only if X1 , . . . , Xn are mutually independent. (The only
if follows because I(X; Y ) > 0 whenever X and Y are not independent, by Jensen’s inequality and
the fact that Dkl (P ||Q) > 0 unless P = Q.)
We return to information and divergence now. Suppose that random variables Yi are indepen-
dent conditional on X, meaning that

P (Y1 = y1 , . . . , Yn = yn | X = x) = P (Y1 = y1 | X = x) · · · P (Yn = yn | X = x).

Such scenarios are common—as we shall see—when we make multiple observations from a fixed
distribution parameterized by some X. Then we have the inequality
n
X
I(X; Y1 , . . . , Yn ) = [H(Yi | Y1i−1 ) − H(Yi | X, Y1i−1 )]
i=1
n n n (2.1.7)
X X X
= [H(Yi | Y1i−1 ) − H(Yi | X)] ≤ [H(Yi ) − H(Yi | X)] = I(X; Yi ),
i=1 i=1 i=1

where the inequality follows because conditioning reduces entropy.

2.1.3 Data processing inequalities:


A standard problem in information theory (and statistical inference) is to understand the degrada-
tion of a signal after it is passed through some noisy channel (or observation process). The simplest
of such results, which we will use frequently, is that we can only lose information by adding noise.
In particular, assume we have the Markov chain

X → Y → Z.

Then we obtain the classical data processing inequality.

Proposition 2.1.10. With the above Markov chain, we have I(X; Z) ≤ I(X; Y ).

Proof We expand the mutual information I(X; Y, Z) in two ways:

I(X; Y, Z) = I(X; Z) + I(X; Y | Z)


= I(X; Y ) + I(X; Z | Y ),
| {z }
=0

22
Lexture Notes on Statistics and Information Theory John Duchi

where we note that the final equality follows because X is independent of Z given Y :

I(X; Z | Y ) = H(X | Y ) − H(X | Y, Z) = H(X | Y ) − H(X | Y ) = 0.

Since I(X; Y | Z) ≥ 0, this gives the result.

There are related data processing inequalities for the KL-divergence—which we generalize in
the next section—as well. In this case, we may consider a simple Markov chain X → Z. If we
let P1 and
R P2 be distributions on X and Q1 and Q2 be the induced distributions on Z, that is,
Qi (A) = P(Z ∈ A | x)dPi (x), then we have

Dkl (Q1 ||Q2 ) ≤ Dkl (P1 ||P2 ) ,

the basic KL-divergence data processing inequality. A consequence of this is that, for any function
f and random variables X and Y on the same space, we have

Dkl (f (X)||f (Y )) ≤ Dkl (X||Y ) .

We explore these data processing inequalities more when we generalize KL-divergences in the next
section and in the exercises.

2.2 General divergence measures and definitions


Having given our basic definitions of mutual information and divergence, we now show how the
definitions of KL-divergence and mutual information extend to arbitrary distributions P and Q
and arbitrary sets X . This requires a bit of setup, including defining set algebras (which, we will
see, simply correspond to quantization of the set X ), but allows us to define divergences in full
generality.

2.2.1 Partitions, algebras, and quantizers


Let X be an arbitrary space. A quantizer on X is any function that maps X to a finite collection
of integers. That is, fixing m < ∞, a quantizer is any function q : X → {1, . . . , m}. In particular,
a quantizer q partitions the space X into the subsets of x ∈ X for which q(x) = i. A related
notion—we will see the precise relationship presently—is that of an algebra of sets on X . We say
that a collection of sets A is an algebra on X if the following are true:

1. The set X ∈ A.

2. The collection of sets A is closed under finite set operations: union, intersection, and com-
plementation. That is, A, B ∈ A implies that Ac ∈ A, A ∩ B ∈ A, and A ∪ B ∈ A.

There is a 1-to-1 correspondence between quantizers—and their associated partitions of the set
X —and finite algebras on a set X , which we discuss briefly.1 It should be clear that there is a
one-to-one correspondence between finite partitions of the set X and quantizers q, so we must argue
that finite partitions of X are in one-to-one correspondence with finite algebras defined over X .
1
Pedantically, this one-to-one correspondence holds up to permutations of the partition induced by the quantizer.

23
Lexture Notes on Statistics and Information Theory John Duchi

In one direction, we may consider a quantizer q : X → {1, . . . , m}. Let the sets A1 , . . . , Am
be the partition associated with q, that is, for x ∈ Ai we have q(x) = i, or Ai = q−1 ({i}). Then
we may define an algebra Aq as the collection of all finite set operations performed on A1 , . . . , Am
(note that this is a finite collection, as finite set operations performed on the partition A1 , . . . , Am
induce only a finite collection of sets).
For the other direction, consider a finite algebra A over the set X . We can then construct a
quantizer qA that corresponds to this algebra. To do so, we define an atom of A as any non-empty
set A ∈ A such that if B ⊂ A and B ∈ A, then B = A or B = ∅. That is, the atoms of A are the
“smallest” sets in A. We claim there is a unique partition of X with atomic sets from A; we prove
this inductively.

Base case: There is at least 1 atomic set, as A is finite; call it A1 .

Induction step: Assume we have atomic sets A1 , . . . , Ak ∈ A. Let B = (A1 ∪ · · · ∪ Ak )c be their


complement, which we assume is non-empty (otherwise we have a partition of X into atomic sets).
The complement B is either atomic, in which case the sets {A1 , A2 , . . . , Ak , B} are a partition of
X consisting of atoms of A, or B is not atomic. If B is not atomic, consider all the sets of the form
A ∩ B for A ∈ A. Each of these belongs to A, and at least one of them is atomic, as there is a
finite number of them. This means there is a non-empty set Ak+1 ⊂ B such that Ak+1 is atomic.
By repeating this induction, which must stop at some finite index m as A is finite, we construct
a collection A1 , . . . , Am of disjoint atomic sets in A for which and ∪i Ai = X . (The uniqueness is
an exercise for the reader.) Thus we may define the quantizer qA via

qA (x) = i when x ∈ Ai .

2.2.2 KL-divergence
In this section, we present the general definition of a KL-divergence, which holds for any pair of
distributions. Let P and Q be distributions on a space X . Now, let A be a finite algebra on X
(as in the previous section, this is equivalent to picking a partition of X and then constructing the
associated algebra), and assume that its atoms are atoms(A). The KL-divergence between P and
Q conditioned on A is
X P (A)
Dkl (P ||Q | A) := P (A) log .
Q(A)
A∈atoms(A)

That is, we simply sum over the partition of X . Another way to write this is as follows. Let
q : X → {1, . . . , m} be a quantizer, and define the sets Ai = q−1 ({i}) to be the pre-images of each
i (i.e. the different quantization regions, or the partition of X that q induces). Then the quantized
KL-divergence between P and Q is
m
X P (Ai )
Dkl (P ||Q | q) := P (Ai ) log .
Q(Ai )
i=1

We may now give the fully general definition of KL-divergence: the KL-divergence between P
and Q is defined as

Dkl (P ||Q) := sup {Dkl (P ||Q | A) such that A is a finite algebra on X }


(2.2.1)
= sup {Dkl (P ||Q | q) such that q quantizes X } .

24
Lexture Notes on Statistics and Information Theory John Duchi

This also gives a rigorous definition of mutual information. Indeed, if X and Y are random variables
with joint distribution PXY and marginal distributions PX and PY , we simply define
I(X; Y ) = Dkl (PXY ||PX × PY ) .
When P and Q have densities p and q, the definition (2.2.1) reduces to
Z
p(x)
Dkl (P ||Q) = p(x) log dx,
R q(x)
while if P and Q both have probability mass functions p and q, then—as we see in Exercise 2.6—the
definition (2.2.1) is equivalent to
X p(x)
Dkl (P ||Q) = p(x) log ,
x
q(x)

precisely as in the discrete case.


We remark in passing that if the set X is a product space, meaning that X = X1 × X2 × · · · × Xn
for some n < ∞ (this is the case for mutual information, for example), then we may assume our
quantizer always quantizes sets of the form A = A1 × A2 × · · · × An , that is, Cartesian products.
Written differently, when we consider algebras on X , the atoms of the algebra may be assumed to be
Cartesian products of sets, and our partitions of X can always be taken as Cartesian products. (See
Gray [104, Chapter 5].) Written slightly differently, if P and Q are distributions on X = X1 ×· · ·×Xn
and qi is a quantizer for the set Xi (inducing the partition Ai1 , . . . , Aimi of Xi ) we may define
X P (A1j1 × A2j2 × · · · × Anjn )
Dkl P ||Q | q1 , . . . , qn = P (A1j1 × A2j2 × · · · × Anjn ) log

.
j ,...,j
Q(A1j1 × A2j2 × · · · × Anjn )
1 n

Then the general definition (2.2.1) of KL-divergence specializes to


Dkl (P ||Q) = sup Dkl P ||Q | q1 , . . . , qn such that qi quantizes Xi .
 

So we only need consider “rectangular” sets in the definitions of KL-divergence.

Measure-theoretic definition of KL-divergence If you have never seen measure theory be-
fore, skim this section; while the notation may be somewhat intimidating, it is fine to always
consider only continuous or fully discrete distributions. We will describe an interpretation that will
mean for our purposes that one never needs to really think about measure theoretic issues.
The general definition (2.2.1) of KL-divergence is equivalent to the following. Let µ be a measure
on X , and assume that P and Q are absolutely continuous with respect to µ, with densities p and
q, respectively. (For example, take µ = P + Q.) Then
Z
p(x)
Dkl (P ||Q) = p(x) log dµ(x). (2.2.2)
X q(x)
The proof of this fact is somewhat involved, requiring the technology of Lebesgue integration. (See
Gray [104, Chapter 5].)
For those who have not seen measure theory, the interpretation
R of the equality (2.2.2) should be
as follows. When integrating a function f (x), replace f (x)dµ(x) with one of two pairsR of symbols:
one may simply think of dµ(x) as dx, so that
R we are performing standard integration f (x)dx, or
one should think
R of the integralPoperation f (x)dµ(x) as summing the argument of the integral, so
dµ(x) = 1 and f (x)dµ(x) = x f (x). (This corresponds to µ being “counting measure” on X .)

25
Lexture Notes on Statistics and Information Theory John Duchi

2.2.3 f -divergences
A more general notion of divergence is the so-called f -divergence, or Ali-Silvey divergence [6, 59]
(see also the alternate interpretations in the article by Liese and Vajda [137]). Here, the definition
is as follows. Let P and Q be probability distributions on the set X , and let f : R+ → R be a
convex function satisfying f (1) = 0. If X is a discrete set, then the f -divergence between P and Q
is  
X p(x)
Df (P ||Q) := q(x)f .
x
q(x)

More generally, for any set X and a quantizer q : X → {1, . . . , m}, letting Ai = q−1 ({i}) = {x ∈
X | q(x) = i} be the partition the quantizer induces, we can define the quantized divergence
m  
X P (Ai )
Df (P ||Q | q) = Q(Ai )f ,
Q(Ai )
i=1

and the general definition of an f divergence is (in analogy with the definition (2.2.1) of general
KL divergences)

Df (P ||Q) := sup {Df (P ||Q | q) such that q quantizes X } . (2.2.3)

The definition (2.2.3) shows that, any time we have computations involving f -divergences—such
as KL-divergence or mutual information—it is no loss of generality, when performing the compu-
tations, to assume that all distributions have finite discrete support. There is a measure-theoretic
version of the definition (2.2.3) which is frequently easier to use. Assume w.l.o.g. that P and Q are
absolutely continuous with respect to the base measure µ. The f divergence between P and Q is
then Z  
p(x)
Df (P ||Q) := q(x)f dµ(x). (2.2.4)
X q(x)
This definition, it turns out, is not quite as general as we would like—in particular, it is unclear
how we should define the integral for points x such that q(x) = 0. With that in mind, we recall
that the perspective transform (see Appendices B.1.1 and B.3.3) of a function f : R → R is defined
by pers(f )(t, u) = uf (t/u) if u > 0 and by +∞ if u ≤ 0. This function is convex in its arguments
(Proposition B.3.12). In fact, this is not quite enough for the fully correct definition. The closure of
a convex function f is cl f (x) = sup{ℓ(x) | ℓ ≤ f, ℓ linear}, the supremum over all linear functions
that globally lower bound f . Then [111, Proposition IV.2.2.2] the closer of pers(f ) is defined, for
any t′ ∈ int dom f , by

uf (t/u)
 if u > 0

cl pers(f )(t, u) = limα↓0 αf (t − t + t/α) if u = 0

+∞ if u < 0.

(The choice of t′ does not affect the definition.) Then the fully general formula expressing the
f -divergence is Z
Df (P ||Q) = cl pers(f )(p(x), q(x))dµ(x). (2.2.5)
X
This is what we mean by equation (2.2.4), which we use without comment.
In the exercises, we explore several properties of f -divergences, including the quantized repre-
sentation (2.2.3), showing different data processing inequalities and orderings of quantizers based

26
Lexture Notes on Statistics and Information Theory John Duchi

on the fineness of their induced partitions. Broadly, f -divergences satisfy essentially the same prop-
erties as KL-divergence, such as data-processing inequalities, and they provide a generalization of
mutual information. We explore f -divergences from additional perspectives later—they are impor-
tant both for optimality in estimation and related to consistency and prediction problems, as we
discuss in Chapter 16.5.

Examples We give several examples of f -divergences here; in Section 9.2.2 we provide a few
examples of their uses as well as providing a few natural inequalities between them.

Example 2.2.1 (KL-divergence): By taking f (t) = t log t, which is convex and satisfies
f (1) = 0, we obtain Df (P ||Q) = Dkl (P ||Q). 3

Example 2.2.2 (KL-divergence, reversed): By taking f (t) = − log t, we obtain Df (P ||Q) =


Dkl (Q||P ). 3

Example 2.2.3 (Total variation distance): The total variation distance between probability
distributions P and Q defined on a set X is the maximum difference between probabilities they
assign on subsets of X :
∥P − Q∥TV := sup |P (A) − Q(A)| = sup (P (A) − Q(A)), (2.2.6)
A⊂X A⊂X

where the second equality follows by considering compliments P (Ac ) = 1 − P (A). The total
variation distance, as we shall see later, is important for verifying the optimality of different
tests, and appears in the measurement of difficulty of solving hypothesis testing problems. The
choice f (t) = 21 |t − 1|, we obtain the total variation distance, that is, ∥P − Q∥TV = Df (P ||Q).
There are several alternative characterizations, which we provide as Lemma 2.2.4 next; it will
be useful in the sequel when we develop inequalities relating the divergences. 3

Lemma 2.2.4. Let P, Q be probability measures with densities p, q with respect to a base measure
µ and f (t) = 21 |t − 1|. Then
Z
1
∥P − Q∥TV = Df (P ||Q) = |p(x) − q(x)|dµ(x)
2
Z Z
= [p(x) − q(x)]+ dµ(x) = [q(x) − p(x)]+ dµ(x)

= P (dP/dQ > 1) − Q(dP/dQ > 1) = Q(dQ/dP > 1) − P (dQ/dP > 1).


In particular, the set A = {x | p(x)/q(x) ≥ 1} maximizes P (B)−Q(B) over B ⊂ X and so achieves
∥P − Q∥TV = P (A) − Q(A).
Proof Eliding the measure-theoretic details,2 we immediately have
Z Z
1 p(x) 1
Df (P ||Q) = − 1 q(x)dµ(x) = |p(x) − q(x)|dµ(x)
2 q(x) 2
Z Z
1 1
= [p(x) − q(x)] dµ(x) + [q(x) − p(x)] dµ(x)
2 x:p(x)>q(x) 2 x:q(x)>p(x)
Z Z
1 1
= [p(x) − q(x)]+ dµ(x) + [q(x) − p(x)]+ dµ(x).
2 2
2
R To make thisRfully rigorous, we Rwould use the Hahn decomposition of the signed measure P − Q to recognize that
f (dP − dQ) = f [dP − dQ]+ − f [dQ − dP ]+ for any integrable f .

27
Lexture Notes on Statistics and Information Theory John Duchi

R
Considering the last inegral [q(x) − p(x)]+ dµ(x), we see that the set A = {x : q(x) > p(x)}
satisfies
Z Z
Q(A) − P (A) = (q(x) − p(x))dµ(x) ≥ (q(x) − p(x))dµ(x) = Q(B) − P (B)
A B

for any set B, as any x ∈ B \ A clearly satisfies q(x) − p(x) ≤ 0.

Example 2.2.5 (Hellinger distance): The Hellinger distance between √ probability distribu-

tions P and Q defined on a set X is generated by the function f (t) = ( t − 1)2 = t − 2 t + 1.
The Hellinger distance is then
Z p
2 1 p
dhel (P, Q) := ( p(x) − q(x))2 dµ(x). (2.2.7)
2
The non-squared version dhel (P, Q) is indeed a distance between probability measures P and
Q. It is sometimes convenient to rewrite the Hellinger distance in terms of the affinity between
P and Q, as
Z Z p
2 1 p
dhel (P, Q) = (p(x) + q(x) − 2 p(x)q(x))dµ(x) = 1 − p(x)q(x)dµ(x), (2.2.8)
2
which makes clear that dhel (P, Q) ∈ [0, 1] is on roughly the same scale as the variation distance;
we will say more later. 3

Example 2.2.6 (χ2 divergence): The χ2 -divergence is generated by taking f (t) = (t − 1)2 ,
so that 2
p(x)2
Z  Z
p(x)
Dχ2 (P ||Q) := − 1 q(x)dµ(x) = dµ(x) − 1, (2.2.9)
q(x) q(x)
where the equality is immediate because pdµ = qdµ = 1. 3
R R

2.2.4 Inequalities and relationships between divergences


Important to our development will come will be different families of inequalities relating the different
divergence measures. These inequalities will be particularly important because, in some cases,
different distributions admit easy calculations with some divergences, such as KL or χ2 divergence,
but it can be challenging to work with others that may be more “natural” for a particular problem.
Most importantly, replacing a variation distance by bounding it with an alternative divergence is
often convenient for analyzing the properties of product distributions (as will become apparent
in Chapter 9). We record several of these results here, making a passing connection to mutual
information as well.
The first inequality shows that the Hellinger distance and variation distance roughly generate
the same topology on collections of distributions, as they upper and lower bound the other (if we
tolerate polynomial losses).

Proposition 2.2.7. The total variation distance and Hellinger distance satisfy
q
2
dhel (P, Q) ≤ ∥P − Q∥TV ≤ dhel (P, Q) 2 − d2hel (P, Q).

28
Lexture Notes on Statistics and Information Theory John Duchi

Proof We begin with the upper bound. We have by Hölder’s inequality that
Z Z p
1 p p p
|p(x) − q(x)|dµ(x) = | p(x) − q(x)| · | p(x) + q(x)|dµ(x)
2
 Z 1  Z 1
1 p p 2
2 1 p p 2
2
≤ ( p(x) − q(x)) dµ(x) ( p(x) + q(x)) dµ(x)
2 2
 Z p 1
2
= dhel (P, Q) 1 + p(x)q(x)dµ(x) .

Rp
As in Example 2.2.5, we have p(x)q(x)dµ(x) = 1 − dhel (P, Q)2 , so this (along with the repre-
sentation Lemma 2.2.4 for variation distance) implies
Z
1 1
∥P − Q∥TV = |p(x) − q(x)|dµ(x) ≤ dhel (P, Q)(2 − d2hel (P, Q)) 2 .
2

For the lower bound on total variation, note that for any a, b ∈ R+ , we have a + b − 2 ab ≤ |a − b|
(check the cases a > b and a < b separately); thus
Z Z
2 1 h p i 1
dhel (P, Q) = p(x) + q(x) − 2 p(x)q(x) dµ(x) ≤ |p(x) − q(x)|dµ(x),
2 2
as desired.

Several important inequalitites relate the variation distance to the KL-divergence. We state
two important inequalities in the next proposition, both of which are important enough to justify
their own names.
Proposition 2.2.8. The total variation distance satisfies the following relationships.
(a) Pinsker’s inequality: for any distributions P and Q,
1
∥P − Q∥2TV ≤ Dkl (P ||Q) . (2.2.10)
2

(b) The Bretagnolle-Huber inequality: for any distributions P and Q,


p 1
∥P − Q∥TV ≤ 1 − exp(−Dkl (P ||Q)) ≤ 1 − exp(−Dkl (P ||Q)).
2

Proof Exercise 2.19 outlines one proof of Pinsker’s inequality using the data processing inequality
(Proposition 2.2.13). We present an alternative via the Cauchy-Schwarz inequality. Using the
definition (2.2.1) of the KL-divergence, we may assume without loss of generality that P and Q are
finitely P
supported, say with p.m.f.s p1 , . . . , pm and q1 , . . . , qm . Define the negative entropy function
h(p) = m 2 1 2
i=1 pi log pi . Then showing that Dkl (P ||Q) ≥ 2 ∥P − Q∥TV = 2 ∥p − q∥1 is equivalent to
showing that
1
h(p) ≥ h(q) + ⟨∇h(q), p − q⟩ + ∥p − q∥21 , (2.2.11)
2
because by inspection h(p)−h(q)−⟨∇h(q), p−q⟩ = i pi log pqii . We do this via a Taylor expansion:
P
we have
∇h(p) = [log pi + 1]m 2
i=1 and ∇ h(p) = diag([1/pi ]i=1 ).
m

29
Lexture Notes on Statistics and Information Theory John Duchi

By Taylor’s theorem, there is some p̃ = (1 − t)p + tq, where t ∈ [0, 1], such that
1
h(p) = h(q) + ⟨∇h(q), p − q⟩ + ⟨p − q, ∇2 h(p̃)(p − q)⟩.
2
P
But looking at the final quadratic, we have for any vector v and any p ≥ 0 satisfying i pi = 1,
m m m 2
v2 v2
X
X X √ |vi |
2
⟨v, ∇ h(p̃)v⟩ = i
= ∥p∥1 i
≥ pi √ = ∥v∥21 ,
pi pi pi
i=1 i=1 i=1
√ √
where the inequality follows from Cauchy-Schwarz applied to the vectors [ pi ]i and [|vi |/ pi ]i .
Thus inequality (2.2.11) holds. Rp
For the claim (b), we use Proposition 2.2.7. Let a = p(x)q(x)dµ(x) be a shorthand
√ √ for the
2
affinity, so that dhel (P, Q) = 1 − a. Then Proposition 2.2.7 gives ∥P − Q∥TV ≤ 1 − a 1 + a =

1 − a2 . Now apply Jensen’s inequality to the exponential: we have
Z p Z s Z  
q(x) 1 q(x)
p(x)q(x)dµ(x) = p(x)dµ(x) = exp log p(x)dµ(x)
p(x) 2 p(x)
 Z   
1 q(x) 1
≥ exp p(x) log dµ(x) = exp − Dkl (P ||Q) .
2 p(x) 2
√ q
In particular, 1 − a2 ≤ 1 − exp(− 12 Dkl (P ||Q))2 , which is the first claim of part (b). For the

second, note that 1 − c ≤ 1 − 12 c for c ∈ [0, 1] by concavity of the square root.

We also have the following bounds on the Hellinger distance in terms of the KL-divergence, and
that in terms of the χ2 -divergence.

Proposition 2.2.9. For any distributions P, Q,

2d2hel (P, Q) ≤ Dkl (P ||Q) ≤ log(1 + Dχ2 (P ||Q)) ≤ Dχ2 (P ||Q) .

Proof For the first inequality, note that log x ≤ x − 1 by concavity, or 1 − x ≤ − log x, so that
Z p
2d2hel (P, Q) = 2 − 2 p(x)q(x)dµ(x)
Z s ! Z s
q(x) p(x)
= 2 p(x) 1 − dµ(x) ≤ 2 p(x) log dµ(x) = Dkl (P ||Q) .
p(x) q(x)

The last two inequalitites are simple: by Jensen’s inequality, we have

dP 2
Z
Dkl (P ||Q) ≤ log = log(1 + Dχ2 (P ||Q)).
dQ
The last inequality is immediate as log(1 + t) ≤ t for all t > −1.

It is also possible to relate mutual information between distributions to f -divergences, and even
to bound the mutual information above and below by the Hellinger distance for certain problems. In

30
Lexture Notes on Statistics and Information Theory John Duchi

this case, we consider the following situation: let V ∈ {0, 1} uniformly at random, and conditional
on V = v, draw X ∼ Pv for some distribution Pv on a space X . Then we have that
1  1 
I(X; V ) = Dkl P0 ||P + Dkl P1 ||P
2 2
where P = 21 P0 + 12 P1 . The divergence measure on the right side of the preceding identity is a
special case of the Jenson-Shannon divergence, defined for λ ∈ [0, 1] by
Djs,λ (P ||Q) := λDkl (P ||λP + (1 − λ)Q) + Dkl (Q||λP + (1 − λ)Q) , (2.2.12)
which is a symmetrized and bounded variant of the typical KL-divergence (we use the shorthand
Djs (P ||Q) := Djs, 1 (P ||Q) for the symmetric case). As a consequence, we also have
2

1 1
I(X; V ) = Df (P0 ||P1 ) + Df (P1 ||P0 ) ,
2 2
1
where f (t) = −t log( 2t + 21 ) = t log t+1
2t
, so that the mutual information is a particular f -divergence.
This form—as we see in the later chapters—is frequently convenient because it gives an object
with similar tensorization properties to KL-divergence while enjoying the boundedness properties
of Hellinger and variation distances. The following proposition captures the latter properties.
Proposition 2.2.10. Let (X, V ) be distributed as above. Then
 
2 log 2 · ∥P0 − P1 ∥TV ,
log 2 · dhel (P0 , P1 ) ≤ I(X; V ) = Djs (P0 ||P1 ) ≤ min .
2 · d2hel (P0 , P1 )
Proof The lower bound and upper bound involving the variation distance both follow from
analytic bounds on the binary entropy functional h2 (p) = −p log p−(1−p) log(1−p). By expanding
the mutual information and letting p0 and p1 be densities of P0 and P1 with respect to some base
measure µ, we have
Z Z
2p0 2p1
2I(X; V ) = 2Djs (P0 ||P1 ) = p0 log dµ + p1 log dµ
p0 + p1 p0 + p1
Z  
p0 p0 p1 p1
= 2 log 2 + (p0 + p1 ) log + log dµ
p1 + p1 p0 + p1 p1 + p1 p0 + p1
Z  
p0
= 2 log 2 − (p0 + p1 )h2 dµ.
p1 + p0
We claim that p
2 log 2 · min{p, 1 − p} ≤ h2 (p) ≤ 2 log 2 · p(1 − p)
for all p ∈ [0, 1] (see Exercises 2.17 and 2.18). Then the upper and lower bounds on the information
become nearly immediate.
For the variation-based upper bound on I(X; V ), we use the lower bound h2 (p) ≥ 2 log 2 ·
min{p, 1 − p} to write
Z  
2 p0 (x) p1 (x)
I(X; V ) ≤ 2 − (p0 (x) + p1 (x)) min , dµ(x)
log 2 p0 (x) + p1 (x) p0 (x) + p1 (x)
Z
= 2 − 2 min{p0 (x), p1 (x)}dµ(x)
Z Z
= 2 (p1 (x) − min{p0 (x), p1 (x)})dµ(x) = 2 (p1 (x) − p0 (x))dµ(x).
p1 >p0

31
Lexture Notes on Statistics and Information Theory John Duchi

But of course the final integral is ∥P1 − P0 ∥TV , giving I(X; V ) ≤ log 2 ∥P0 −pP1 ∥TV . Conversely,
for the lower bound on Djs (P0 ||P1 ), we use the upper bound h2 (p) ≤ 2 log 2 · p(1 − p) to obtain
Z r
1 p0  p0 
I(X; V ) ≥ 1 − (p0 + p1 ) 1− dµ
log 2 p1 + p0 p1 + p0
√ √ √
Z Z
1
=1− p0 p1 dµ = ( p0 − p1 )2 dµ = d2hel (P0 , P1 )
2
as desired.
The Hellinger-based upper bound is simpler: by Proposition 2.2.9, we have
1 1
Djs (P0 ||P1 ) = Dkl (P0 ||(P0 + P1 )/2) + Dkl (P1 ||(P0 + P1 )/2)
2 2
1 1
≤ Dχ2 (P0 ||(P0 + P1 )/2) + Dχ2 (P1 ||(P0 + P1 )/2)
2 2
Z √ √ √ √
(p0 − p1 )2 ( p0 − p1 )2 ( p0 + p1 )2
Z
1 1
= dµ = dµ.
2 p0 + p1 2 p0 + p1
√ √
Now note that (a + b)2 ≤ 2aR2 + 2b2 for any a, b ∈ R, and so ( p0 + p1 )2 ≤ 2(p0 + p1 ), and thus
√ √ 2
the final integral has bound ( p0 − p1 ) dµ = 2d2hel (P0 , P1 ).

2.2.5 Convexity and data processing for divergence measures


f -divergences satisfy a number of very useful properties, which we use repeatedly throughout the
lectures. As the KL-divergence is an f -divergence, it of course satisfies these conditions; however,
we state them in fuller generality, treating the KL-divergence results as special cases and corollaries.
We begin by exhibiting the general data processing properties and convexity properties of f -
divergences, each of which specializes to KL divergence. We leave the proof of each of these as
exercises. First, we show that f -divergences are jointly convex in their arguments.

Proposition 2.2.11. Let P1 , P2 , Q1 , Q2 be distributions on a set X and f : R+ → R be convex.


Then for any λ ∈ [0, 1],

Df (λP1 + (1 − λ)P2 ||λQ1 + (1 − λ)Q2 ) ≤ λDf (P1 ||Q1 ) + (1 − λ)Df (P2 ||Q2 ) .

The proof of this proposition we leave as Exercise 2.11, which we treat as a consequence of the
more general “log-sum” like inequalities of Exercise 2.8. It is, however, an immediate consequence
of the fully specified definition (2.2.5) of an f -divergence, because pers(f ) is jointly convex. As an
immediate corollary, we see that the same result is true for KL-divergence as well.

Corollary 2.2.12. The KL-divergence Dkl (P ||Q) is jointly convex in its arguments P and Q.

We can also provide more general data processing inequalities for f -divergences, paralleling
those for the KL-divergence. In this case, we consider random variables X and Z on spaces X
and Z, respectively, and a Markov transition kernel K giving the Markov chain X → Z. That
is, K(· | x) is a probability distribution on Z for each x ∈ X , and conditioned on X = x, Z has
distribution K(· | x) so that K(A | x) = P(Z ∈ A | X = x). Certainly, this includes the situation

32
Lexture Notes on Statistics and Information Theory John Duchi

when Z = ϕ(X) for some function ϕ, and more generally when Z = ϕ(X, U ) for a function ϕ and
some additional randomness U . For a distribution P on X, we then define the marginals
Z
KP (A) := K(A, x)dP (x).
X

We then have the following proposition.

Proposition 2.2.13. Let P and Q be distributions on X and let K be any Markov kernel. Then

Df (KP ||KQ ) ≤ Df (P ||Q) .

See Exercise 2.10 for a proof.


As a corollary, we obtain the following data processing inequality for KL-divergences, where we
abuse notation to write Dkl (X||Y ) = Dkl (P ||Q) for random variables X ∼ P and Y ∼ Q.

Corollary 2.2.14. Let X, Y ∈ X be random variables, let U ∈ U be independent of X and Y , and


let ϕ : X × U → Z for some spaces X , U, Z. Then

Dkl (ϕ(X, U )||ϕ(Y, U )) ≤ Dkl (X||Y ) .

Thus, further processing of random variables can only bring them “closer” in the space of distribu-
tions; downstream processing of signals cannot make them further apart as distributions.

2.3 First steps into optimal procedures: testing inequalities


As noted in the introduction, a central benefit of the information theoretic tools we explore is that
they allow us to certify the optimality of procedures—that no other procedure could (substantially)
improve upon the one at hand. The main tools for these certifications are often inequalities gov-
erning the best possible behavior of a variety of statistical tests. Roughly, we put ourselves in the
following scenario: nature chooses one of a possible set of (say) k worlds, indexed by probabil-
ity distributions P1 , P2 , . . . , Pk , and conditional on nature’s choice of the world—the distribution
P ⋆ ∈ {P1 , . . . , Pk } chosen—we observe data X drawn from P ⋆ . Intuitively, it will be difficult to
decide which distribution Pi is the true P ⋆ if all the distributions are similar—the divergence be-
tween the Pi is small, or the information between X and P ⋆ is negligible—and easy if the distances
between the distributions Pi are large. With this outline in mind, we present two inequalities, and
first examples of their application, to make concrete these connections to the notions of information
and divergence defined in this section.

2.3.1 Le Cam’s inequality and binary hypothesis testing


The simplest instantiation of the above setting is the case when there are only two possible dis-
tributions, P1 and P2 , and our goal is to make a decision on whether P1 or P2 is the distribution
generating data we observe. Concretely, suppose that nature chooses one of the distributions P1
or P2 at random, and let V ∈ {1, 2} index this choice. Conditional on V = v, we then observe a
sample X drawn from Pv . Denoting by P the joint distribution of V and X, we have for any test
Ψ : X → {1, 2} that the probability of error is then
1 1
P(Ψ(X) ̸= V ) = P1 (Ψ(X) ̸= 1) + P2 (Ψ(X) ̸= 2).
2 2

33
Lexture Notes on Statistics and Information Theory John Duchi

We can give an exact expression for the minimal possible error in the above hypothesis test.
Indeed, a standard result of Le Cam (see [134, 194, Lemma 1]) is the following variational representa-
tion of the total variation distance (2.2.6), which is the f -divergence associated with f (t) = 12 |t − 1|,
as a function of testing error.

Proposition 2.3.1. Let X be an arbitrary set. For any distributions P1 and P2 on X , we have

inf {P1 (Ψ(X) ̸= 1) + P2 (Ψ(X) ̸= 2)} = 1 − ∥P1 − P2 ∥TV ,


Ψ

where the infimum is taken over all tests Ψ : X → {1, 2}.

Proof Any test Ψ : X → {1, 2} has an acceptance region, call it A ⊂ X , where it outputs 1 and
a region Ac where it outputs 2.

P1 (Ψ ̸= 1) + P2 (Ψ ̸= 2) = P1 (Ac ) + P2 (A) = 1 − P1 (A) + P2 (A).

Taking an infimum over such acceptance regions, we have

inf {P1 (Ψ ̸= 1) + P2 (Ψ ̸= 2)} = inf {1 − (P1 (A) − P2 (A))} = 1 − sup (P1 (A) − P2 (A)),
Ψ A⊂X A⊂X

which yields the total variation distance as desired.

In the two-hypothesis case, we also know that the optimal test, by the Neyman-Pearson lemma,
is a likelihood ratio test. That is, assuming that P1 and P2 have densities p1 and p2 , the optimal
test is of the form
1 if pp12 (X)
(
(X) ≥ t
Ψ(X) = p1 (X)
2 if p2 (X) < t

for some threshold t ≥ 0. In the case that the prior probabilities on P1 and P2 are each 21 , then
t = 1 is optimal.
We give one example application of Proposition 2.3.1 to the problem of testing a normal mean.

iid
Example 2.3.2 (Testing a normal mean): Suppose we observe X1 , . . . , Xn ∼ P for P = P1
or P = P2 , where Pv is the normal distribution N(µv , σ 2 ), where µ1 ̸= µ2 . We would like to
understand the sample size n necessary to guarantee that no test can have small error, that
is, say, that
1
inf {P1 (Ψ(X1 , . . . , Xn ) ̸= 1) + P2 (Ψ(X1 , . . . , Xn ) ̸= 2)} ≥ .
Ψ 2
By Proposition 2.3.1, we have that

inf {P1 (Ψ(X1 , . . . , Xn ) ̸= 1) + P2 (Ψ(X1 , . . . , Xn ) ̸= 2)} ≥ 1 − ∥P1n − P2n ∥TV ,


Ψ

iid
where Pvn denotes the n-fold product of Pv , that is, the distribution of X1 , . . . , Xn ∼ Pv .
The interaction between total variation distance and product distributions is somewhat
subtle, so it is often advisable to use a divergence measure more attuned to the i.i.d. nature
of the sampling scheme. Two such measures are the KL-divergence and Hellinger distance,
both of which we explore in the coming chapters. With that in mind, we apply Pinsker’s

34
Lexture Notes on Statistics and Information Theory John Duchi

inequality (2.2.10) to see that ∥P1n − P2n ∥2TV ≤ 21 Dkl (P1n ||P2n ) = n2 Dkl (P1 ||P2 ), which implies
that
r r  1 √
n n n 1 n 1 2
2 n |µ1 − µ2 |
1 − ∥P1 − P2 ∥TV ≥ 1 − Dkl (P1 ||P2 ) = 1 −
2
2
(µ1 − µ2 ) =1− .
2 2 2σ 2 σ
σ2
In particular, if n ≤ (µ1 −µ2 )2
, then we have our desired lower bound of 21 .
2
Conversely, a calculation yields that n ≥ (µ1Cσ −µ2 )2
, for some numerical constant C ≥ 1,
implies small probability of error. We leave this calculation to the reader. 3

2.3.2 Fano’s inequality and multiple hypothesis testing


There are of course situations in which we do not wish to simply test two hypotheses, but have
multiple hypotheses present. In such situations, Fano’s inequality, which we present shortly, is
the most common tool for proving fundamental limits, lower bounds on probability of error, and
converses (to results on achievability of some performance level) in information theroy. We write
this section in terms of general random variables, ignoring the precise setting of selecting an index
in a family of distributions, though that is implicit in what we do.
Let X be a random variable taking values in a finite set X , and assume that we observe a
(different) random variable Y , and then must estimate or guess the true value of X. b That is, we
have the Markov chain
X → Y → X, b

and we wish to provide lower bounds on the probability of error—that is, that X b ̸= X. If we let
the function h2 (p) = −p log p − (1 − p) log(1 − p) denote the binary entropy (entropy of a Bernoulli
random variable with parameter p), Fano’s inequality takes the following form [e.g. 57, Chapter 2]:
Proposition 2.3.3 (Fano inequality). For any Markov chain X → Y → X,
b we have

b ̸= X)) + P(X
h2 (P(X b ̸= X) log(|X | − 1) ≥ H(X | X).
b (2.3.1)

Proof This proof follows by expanding an entropy functional in two different ways. Let E be
b ̸= X, that is, E = 1 if X
the indicator for the event that X b ̸= X and is 0 otherwise. Then we have

H(X, E | X)
b = H(X | E, X)
b + H(E | X)
b
= P(E = 1)H(X | E = 1, X)
b + P(E = 0) H(X | E = 0, X)
b +H(E | X),
b
| {z }
=0

where the zero follows because given there is no error, X has no variability given X.
b Expanding
the entropy by the chain rule in a different order, we have

H(X, E | X)
b = H(X | X)
b + H(E | X,
b X),
| {z }
=0

because E is perfectly predicted by X


b and X. Combining these equalities, we have

H(X | X)
b = H(X, E | X)
b = P(E = 1)H(X | E = 1, X)
b + H(E | X).

Noting that H(E | X) ≤ H(E) = h2 (P(E = 1)), as conditioning reduces entropy, and that
H(X | E = 1, X)
b ≤ log(|X | − 1), as X can take on at most |X | − 1 values when there is an error,

35
Lexture Notes on Statistics and Information Theory John Duchi

completes the proof.

We can rewrite Proposition 2.3.3 in a convenient way when X is uniform in X . Indeed, by


definition of the mutual information, we have I(X; X)b = H(X) − H(X | X), b so Proposition 9.4.1
implies that in the canonical hypothesis testing problem from Section 9.2.1, we have

Corollary 2.3.4. Assume that X is uniform on X . For any Markov chain X → Y → X,


b

b ̸= X) ≥ 1 − I(X; Y ) + log 2
P(X . (2.3.2)
log(|X |)

Proof Let Perror = P(X ̸= X) b denote the probability of error. Noting that h2 (p) ≤ log 2 for any
p ∈ [0, 1] (recall inequality (2.1.2), that is, that uniform random variables maximize entropy), then
using Proposition 9.4.1, we have
(i) (ii)
log 2 + Perror log(|X |) ≥ h2 (Perror ) + Perror log(|X | − 1) ≥ H(X | X)
b = H(X) − I(X; X).
b

Here step (i) uses Proposition 2.3.3 and step (ii) uses the definition of mutual information, that
b = H(X) − H(X | X).
I(X; X) b The data processing inequality implies that I(X; X) b ≤ I(X; Y ),
and using H(X) = log(|X |) completes the proof.

In particular, Corollary 2.3.4 shows that when X is chosen uniformly at random and we observe
Y , we have
I(X; Y ) + log 2
inf P(Ψ(Y ) ̸= X) ≥ 1 − ,
Ψ log |X |
where the infimum is taken over all testing procedures Ψ. Some interpretation of this quantity
is helpful. If we think roughly of the number of bits it takes to describe a variable X uniformly
chosen from X , then we expect that log2 |X | bits are necessary (and sufficient). Thus, until we
collect enough information that I(X; Y ) ≈ log |X |, so that I(X; Y )/ log |X | ≈ 1, we are unlikely to
be unable to identify the variable X with any substantial probability. So we must collect enough
bits to actually discover X.

Example 2.3.5 (20 questions game): In the 20 questions game—a standard children’s game—
there are two players, the “chooser” and the “guesser,” and an agreed upon universe X . The
chooser picks an element x ∈ X , and the guesser’s goal is to find x by using a series of yes/no
questions about x. We consider optimal strategies for each player in this game, assuming that
X is finite and letting m = |X | be the universe size for shorthand.
For the guesser, it is clear that at most ⌈log2 m⌉ questions are necessary to guess the item
X that the chooser has picked—at each round of the game, the guesser asks a question that
eliminates half of the remaining possible items. Indeed, let us assume that m = 2l for some
l ∈ N; if not, the guesser can always make her task more difficult by increasing the size of X
until it is a power of 2. Thus, after k rounds, there are m2−k items left, and we have
 k
1
m ≤ 1 if and only if k ≥ log2 m.
2

36
Lexture Notes on Statistics and Information Theory John Duchi

For the converse—the chooser’s strategy—let Y1 , Y2 , . . . , Yk be the sequence of yes/no an-


swers given to the guesser. Assume that the chooser picks X uniformly at random in X . Then
Fano’s inequality (2.3.2) implies that for the guess X
b the guesser makes,

b ̸= X) ≥ 1 − I(X; Y1 , . . . , Yk ) + log 2
P(X .
log m
By the chain rule for mutual information, we have
k
X k
X k
X
I(X; Y1 , . . . , Yk ) = I(X; Yi | Y1:i−1 ) = H(Yi | Y1:i−1 ) − H(Yi | Y1:i−1 , X) ≤ H(Yi ).
i=1 i=1 i=1

As the answers Yi are yes/no, we have H(Yi ) ≤ log 2, so that I(X; Y1:k ) ≤ k log 2. Thus we
find
P(Xb ̸= X) ≥ 1 − (k + 1) log 2 = log2 m − 1 − k ,
log m log2 m log2 m
so that we the guesser must have k ≥ log2 (m/2) to be guaranteed that she will make no
mistakes. 3

2.4 A first operational result: entropy and source coding


The final section of this chapter explores the basic results in source coding. Source coding—in its
simplest form—tells us precisely the number of bits (or some other form of information storage)
are necessary to perfectly encode a seqeunce of random variables X1 , X2 , . . . drawn according to a
known distribution P .

2.4.1 The source coding problem


Assume we receive data consisting of a sequence of symbols X1 , X2 , . . ., drawn from a known
distribution P on a finite or countable space X . We wish to choose an encoding, represented by a
d-ary code function C that maps X to finite strings consisting of the symbols {0, 1, . . . , d − 1}. We
denote this by C : X → {0, 1, . . . , d − 1}∗ , where the superscript ∗ denotes the length may change
from input to input, and use ℓC (x) to denote the length of the string C(x).
In general, we will consider a variety of types of codes; we define each in order of complexity of
their decoding.
Definition 2.1. A d-ary code C : X → {0, . . . , d − 1}∗ is non-singular if for each x, x′ ∈ X we have
C(x) ̸= C(x′ ) if x ̸= x′ .
While Definition 2.1 is natural, generally speaking, we wish to transmit or encode a variety of code-
words simultaneously, that is, we wish to encode a sequence X1 , X2 , . . . using the natural extension
of the code C as the string C(X1 )C(X2 )C(X3 ) · · · , where C(x1 )C(x2 ) denotes the concatenation of
the strings C(x1 ) and C(x2 ). In this case, we require that the code be uniquely decodable:
Definition 2.2. A d-ary code C : X → {0, . . . , d − 1}∗ is uniquely decodable if for all sequences
x1 , . . . , xn ∈ X and x′1 , . . . , x′n ∈ X we have
C(x1 )C(x2 ) · · · C(xn ) = C(x′1 )C(x′2 ) · · · C(x′n ) if and only if x1 = x′1 , . . . , xn = x′n .
That is, the extension of the code C to sequences is non-singular.

37
Lexture Notes on Statistics and Information Theory John Duchi

While more useful (generally) than simply non-singular codes, uniquely decodable codes may require
inspection of an entire string before recovering the first element. With that in mind, we now consider
the easiest to use codes, which can always be decoded instantaneously.

Definition 2.3. A d-ary code C : X → {0, . . . , d − 1}∗ is uniquely decodable or instantaneous if


no codeword is the prefix to another codeword.

As is hopefully apparent from the definitions, all prefix/instantaneous codes are uniquely decodable,
which are in turn non-singular. The converse is not true, though we will see a sense in which—as
long as we care only about encoding sequences—using prefix instead of uniquely decodable codes
has negligible consequences.
For example, written English, with periods (.) and spaces ( ) included at the ends of words
(among other punctuation) is an instantaneous encoding of English into the symbols of the alphabet
and punctuation, as punctuation symbols enforce that no “codeword” is a prefix of any other. A
few more concrete examples may make things more clear.

Example 2.4.1 (Encoding strategies): Consider the encoding schemes below, which encode
the letters a, b, c, and d.
Symbol C1 (x) C2 (x) C3 (x)
a 0 00 0
b 00 10 10
c 000 11 110
d 0000 110 111

By inspection, it is clear that C1 is non-singular but certainly not uniquely decodable (does
the sequence 0000 correspond to aaaa, bb, aab, aba, baa, ca, ac, or d?), while C3 is a prefix
code. We leave showing that C2 is uniquely decodable as an exercise. 3

2.4.2 The Kraft-McMillan inequalities


We now turn to a few results on the connections between source-coding and entropy. Our first
result, the Kraft-McMillan inequality, is an essential result that—as we shall see–essentially says
that there is no difference in code-lengths attainable by prefix codes and uniquely decodable codes.

Theorem 2.4.2. Let X be a finite or countable set, and let ℓ : X → N be a function. If ℓ(x) is the
length of the encoding of the symbol x in a uniquely decodable d-ary code, then
X
d−ℓ(x) ≤ 1. (2.4.1)
x∈X

Conversely, given any function ℓ : X → N satisfying inequality (2.4.1), there is a prefix code whose
codewords have length ℓ(x) for each x ∈ X .

Proof We prove the first statement of the theorem first by a counting and asymptotic argument.
We begin by assuming that X is finite; we eliminate this assumption subsequently. As a
consequence, there is some maximum length ℓmax such that ℓ(x) ≤ ℓmax for all x ∈ X . ForP a sequence
x1 , . . . , xn ∈ X , we have by the definition of our encoding strategy that ℓ(x1 , . . . , xn ) = ni=1 ℓ(xi ).
In addition, for each m we let

En (m) := {x1:n ∈ X n such that ℓ(x1:n ) = m}

38
Lexture Notes on Statistics and Information Theory John Duchi

0 2
1
x1
2 0 2
0 1 1
x2 x3 x5 x6 x7

Figure 2.1. Prefix-tree encoding of a set of symbols. The encoding for x1 is 0, for x2 is 10, for x3
is 11, for x4 is 12, for x5 is 20, for x6 is 21, and nothing is encoded as 1, 2, or 22.

denote the symbols x encoded with codewords of length m in our code, then as the code is uniquely
decodable we certainly have card(En (m)) ≤ dm P for all n and m. Moreover, for all x1:n ∈ X n we
have ℓ(x1:n ) ≤ nℓmax . We thus re-index the sum x d−ℓ(x) and compute

X nℓ
X max

d−ℓ(x1 ,...,xn ) = card(En (m))d−m


x1 ,...,xn ∈X n m=1
nℓ
X max

≤ dm−m = nℓmax .
m=1

The preceding relation is true for all n ∈ N, so that


 X 1/n
−ℓ(x1:n )
d ≤ n1/n ℓ1/n
max → 1
x1:n ∈X n

as n → ∞. In particular, using that


X X X n
−ℓ(x1:n ) −ℓ(x1 ) −ℓ(xn ) −ℓ(x)
d = d ···d = d ,
x1:n ∈X n x1 ,...,xn ∈X n x∈X

we obtain x∈X d−ℓ(x) ≤ 1.


P
Returning to the case that card(X ) = ∞, by defining the sequence
X
Dk := d−ℓ(x) ,
x∈X ,ℓ(x)≤k

as each subset {xP ∈ X : ℓ(x) ≤ k} is uniquely decodable, we have Dk ≤ 1 for all k. Then
1 ≥ limk→∞ Dk = x∈X d−ℓ(x) .
The achievability of such a code intuitively follows by a pictorial argument (recall Figure 2.1),
so we first sketch the result non-rigorously. Indeed, let Td be an (infinite) d-ary tree. Then, at each

39
Lexture Notes on Statistics and Information Theory John Duchi

level m of the tree, assign one of the nodes at that level to each symbol x ∈ X such that ℓ(x) = m.
Eliminate the subtree below that node, and repeat with the remaining symbols. The codeword
corresponding to symbol x is then the path to the symbol in the tree.
P A more formal version implementing this sketch follows. Let ℓ be a length function satisfying
−ℓ(x) ≤ 1. Identify X with N (or a subset thereof) in such a way that 1 ≤ ℓ(1) ≤ ℓ(2) ≤ . . .,
x∈X d
i.e., ℓ(x) ≤ ℓ(y) whenever x < y, and let Xm = {x ∈ Xm | ℓ(x) = m} be the set of inputs with
encoding length m. For each x ∈ N, define the value
X
v(x) = d−ℓ(i) .
i<x

We let the codeword C(x) for x be the first ℓ(x) terms in the d-ary expansion of v(x). Certainly the
length of this encoding satisfies |C(x)| = ℓ(x). To see that it is prefix-free, take two symbols x < y,
and assume for the sake of contradiction that C(x) is a prefix of C(y). Then v(y) ≥ v(x), while
v(y) − v(x) ≤ d−ℓ(x) because the two representations agree on the first ℓ(x) terms in the expansion.
But
X X X X
v(y) − v(x) = d−ℓ(i) − d−ℓ(i) = d−ℓ(i) = d−ℓ(x) + d−ℓ(i) > d−ℓ(x) ,
i<y i<x x≤i<y x<i<y

a contradiction.

With the Kraft-McMillan theorem in place, we we may directly relate the entropy of a random
variable to the length of possible encodings for the variable; in particular, we show that the entropy
is essentially the best possible code length of a uniquely decodable source code. In this theorem,
we use the shorthand X
Hd (X) := − p(x) logd p(x).
x∈X

Theorem 2.4.3. Let X ∈ X be a discrete random variable distributed according to P and let ℓC
be the length function associated with a d-ary encoding C : X → {0, . . . , d − 1}∗ . In addition, let C
be the set of all uniquely decodable d-ary codes for X . Then

Hd (X) ≤ inf {EP [ℓC (X)] : C ∈ C} ≤ Hd (X) + 1.

Proof The lower bound is an argument by convex optimization, while for the upper bound
we give an explicit length function and (implicit) prefix code attaining the bound. For the lower
bound, we assume for simplicity that X is finite, and we identify X = {1, . . . , |X |} (let m = |X | for
shorthand). Then as C consists of uniquely decodable codebooks, all the associated length functions
must satisfy the Kraft-McMillan inequality (2.4.1). Letting ℓi = ℓ(i), the minimal encoding length
is at least (m m
)
X X
infm pi ℓi : d−ℓi ≤ 1 .
ℓ∈R
i=1 i=1
By introducing the Lagrange multiplier λ ≥ 0 for the inequality constraint, we may write the
Lagrangian for the preceding minimization problem as
n
!
X h im
L(ℓ, λ) = p⊤ ℓ + λ d−ℓi − 1 with ∇ℓ L(ℓ, λ) = p − λ d−ℓi log d .
i=1
i=1

40
Lexture Notes on Statistics and Information Theory John Duchi

θ
θ Pm − logd
In particular, the optimal ℓ satisfies ℓi = logd pi for some constant θ, and solving i=1 d
pi
=1
gives θ = 1 and ℓ(i) = logd p1i .
l m
1
To attain the result, simply set our encoding to be ℓ(x) = logd P (X=x) , which satisfies the
Kraft-McMillan inequality and thus yields a valid prefix code with
 
X 1 X
EP [ℓ(X)] = p(x) logd ≤− p(x) logd p(x) + 1 = Hd (X) + 1
p(x)
x∈X x∈X

as desired.

Theorem 2.4.3 thus shows that, at least to within an additive constant of 1, the entropy both
upper and lower bounds the expected length of a uniquely decodable code for the random variable
X. This is the first of our promised “operational interpretations” of the entropy.

2.4.3 Entropy rates and longer codes


Theorem 2.4.3 is a bit unsatisfying in that the additive constant 1 may be quite large relative to
the entropy. By allowing encoding longer sequences, we can (asymptotically) eliminate this error
factor. To that end, we here show that it is possible, at least for appropriate distributions on
random variables Xi , to achieve a per-symbol encoding length that approaches a limiting version of
the Shannon entropy of a random variable. We give two definitions capturing the limiting entropy
properties of sequences of random variables.

Definition 2.4. The entropy rate of a sequence X1 , X2 , . . . of random variables is


1
H({Xi }) := lim H(X1 , . . . , Xn ) (2.4.2)
n→∞ n

whenever the limit exists.

In some situations, the limit (2.4.2) may not exist. However, there are a variety of situations in
which it does, and we focus generally on a specific but common instance in which the limit does
exist. First, we recall the definition of a stationary sequence of random variables.

Definition 2.5. We say a sequence X1 , X2 , . . . of random variable is stationary if for all n and all
k ∈ N and all measurable sets A1 , . . . , Ak ⊂ X we have

P(X1 ∈ A1 , . . . , Xk ∈ Ak ) = P(Xn+1 ∈ A1 , . . . , Xn+k ∈ Ak ).

With this definition, we have the following result.

Proposition 2.4.4. Let the sequence of random variables {Xi }, taking values in the discrete space
X , be stationary. Then
H({Xi }) = lim H(Xn | X1 , . . . , Xn−1 )
n→∞

and the limits (2.4.2) and above exist.

41
Lexture Notes on Statistics and Information Theory John Duchi

1 Pn
Proof We begin by making the following standard observation of Cesàro means: if cn = n i=1 ai
and ai → a, then cn → a.3 Now, we note that for a stationary sequence, we have that
H(Xn | X1:n−1 ) = H(Xn+1 | X2:n ),
and using that conditioning decreases entropy, we have
H(Xn+1 | X1:n ) ≤ H(Xn | X1:n−1 ).
Thus the sequence an := H(Xn | X1:n−1 ) is non-increasing and
P bounded below by 0, so that it has
some limit limn→∞ H(Xn | X1:n−1 ). As H(X1 , . . . , Xn ) = ni=1 H(Xi | X1:i−1 ) by the chain rule
for entropy, we achieve the result of the proposition.

Finally, we present a result showing that it is possible to achieve average code length of at most
the entropy rate, which for stationary sequences is smaller than the entropy of any single random
variable Xi . To do so, we require the use of a block code, which (while it may be prefix code) treats
sets of random variables (X1 , . . . , Xm ) ∈ X m as a single symbol to be jointly encoded.
Proposition 2.4.5. Let the sequence of random variables X1 , X2 , . . . be stationary. Then for any
ϵ > 0, there exists an m ∈ N and a d-ary (prefix) block encoder C : X m → {0, . . . , d − 1}∗ such that
1
lim EP [ℓC (X1:n )] ≤ H({Xi }) + ϵ = lim H(Xn | X1 , . . . , Xn−1 ) + ϵ.
n n n

Proof m ∗
Let C : X → {0, 1, . . . , d − 1} be any prefix code with
 
1
ℓC (x1:m ) ≤ log .
P (X1:m = x1:m )
Then whenever n/m is an integer, we have
n/m n/m
X   X 
EP [ℓC (X1:n )] = EP ℓC (Xmi+1 , . . . , Xm(i+1) ) ≤ H(Xmi+1 , . . . , Xm(i+1) ) + 1
i=1 i=1
n n
= + H(X1 , . . . , Xm ).
m m
1 1
Dividing by n gives the result by taking m suitably large that m +m H(X1 , . . . , Xm ) ≤ ϵ+H({Xi }).
Note that if the m does not divide n, we may also encode the length of the sequence of encoded
words in each block of length m; in particular, if the block begins with a 0, it encodes m symbols,
while if it begins with a 1, then the next ⌈logd m⌉ bits encode the length of the block. This would
yields an increase in the expected length of the code to
2n + ⌈log2 m⌉ n
EP [ℓC (X1:n )] ≤ + H(X1 , . . . , Xm ).
m m
Dividing by n and letting n → ∞ gives the result, as we can always choose m large.

3
Indeed, let ϵ > 0 and take N such that n ≥ N implies that |ai − a| < ϵ. Then for n ≥ N , we have
n n
1X N (cN − a) 1 X N (cN − a)
cn − a = (ai − a) = + (ai − a) ∈ ± ϵ.
n i=1 n n i=N +1 n

Taking n → ∞ yields that the term N (cN − a)/n → 0, which gives that cn − a ∈ [−ϵ, ϵ] eventually for any ϵ > 0,
which is our desired result.

42
Lexture Notes on Statistics and Information Theory John Duchi

2.5 Bibliography
The material in this chapter is classical in information theory. For all of our treatment of mutual
information, entropy, and KL-divergence in the discrete case, Cover and Thomas provide an essen-
tially complete treatment in Chapter 2 of their book [57]. Gray [104] provides a more advanced
(measure-theoretic) version of these results, with Chapter 5 covering most of our results (or Chap-
ter 7 in the newer addition of the same book). Csiszár and Körner [61] is the classic reference for
coding theorems and results on communication, including stronger converse results.
The f -divergence was independently discovered by Ali and Silvey [6] and Csiszár [59], and is
consequently sometimes called an Ali-Silvey divergence or Csiszár divergence. Liese and Vajda [137]
provide a survey of f -divergences and their relationships with different statistical concepts (taking a
Bayesian point of view), and various authors have extended the pairwise divergence measures to di-
vergence measures between multiple distributions [107], making connections to experimental design
and classification [98, 76], which we investigate later in book. The inequalities relating divergences
in Section 2.2.4 are now classical, and standard references present them [134, 182]. For a proof that
equality (2.2.4) is equivalent to the definition (2.2.3) with the appropriate closure operations, see
the paper [76, Proposition 1]. We borrow the proof of the upper bound in Proposition 2.2.10 from
the paper [138].
JCD Comment: Converse to Kraft is Chaitin?

2.6 Exercises
Our first few questions investigate properties of a divergence between distributions that is weaker
than the KL-divergence, but is intimately related to optimal testing. Let P1 and P2 be arbitrary
distributions on a space X . The total variation distance between P1 and P2 is defined as

∥P1 − P2 ∥TV := sup |P1 (A) − P2 (A)| .


A⊂X

Exercise 2.1: Prove the following identities about total variation. Throughout, let P1 and P2
have densities p1 and p2 on a (common) set X .
R
(a) 2 ∥P1 − P2 ∥TV = |p1 (x) − p2 (x)|dx.

(b) For functions f : X → R, Rdefine the supremum norm ∥f ∥∞ = supx∈X |f (x)|. Show that
2 ∥P1 − P2 ∥TV = sup∥f ∥∞ ≤1 X f (x)(p1 (x) − p2 (x))dx.
R
(c) ∥P1 − P2 ∥TV = max{p1 (x), p2 (x)}dx − 1.
R
(d) ∥P1 − P2 ∥TV = 1 − min{p1 (x), p2 (x)}dx.

(e) For functions f, g : X → R,


Z Z 
inf f (x)p1 (x)dx + g(x)p2 (x)dx : f + g ≥ 1, f ≥ 0, g ≥ 0 = 1 − ∥P1 − P2 ∥TV .

Exercise 2.2 (Divergence between multivariate normal distributions): Let P1 be N(θ1 , Σ) and
P2 be N(θ2 , Σ), where Σ ≻ 0 is a positive definite matrix.

43
Lexture Notes on Statistics and Information Theory John Duchi

(a) Give Dkl (P1 ||P2 ).

(b) Show that d2hel (P1 , P2 ) = 1 − exp(− 18 (µ0 − µ1 )⊤ Σ−1 (µ0 − µ1 )).

Exercise 2.3 (The optimal test between distributions): Prove Le-Cam’s inequality: for any
function ψ with dom ψ ⊃ X and any distributions P1 , P2 ,

P1 (ψ(X) ̸= 1) + P2 (ψ(X) ̸= 2) ≥ 1 − ∥P1 − P2 ∥TV .

Thus, the sum of the probabilities of error in a hypothesis testing problem, where based on a sample
X we must decide whether P1 or P2 is more likely, has value at least 1 − ∥P1 − P2 ∥TV . Given P1
and P2 is this risk attainable?
Exercise 2.4: A random variable X has Laplace(λ, µ) distribution if it has density p(x) =
λ
2 exp(−λ|x−µ|). Consider the hypothesis test of P1 versus P2 , where X has distribution Laplace(λ, µ1 )
under P1 and distribution Laplace(λ, µ2 ) under P2 , where µ1 < µ2 . Show that the minimal value
over all tests ψ of P1 versus P2 is
 
 λ
inf P1 (ψ(X) ̸= 1) + P2 (ψ(X) ̸= 2) = exp − |µ1 − µ2 | .
ψ 2

Exercise 2.5 (Log-sum inequality): Let a1 , . . . , an and b1 , . . . , bn be non-negative reals. Show


that
n X n  Pn
X ai ai
ai log ≥ ai log Pi=1
n .
bi i=1 bi
i=1 i=1

(Hint: use the convexity of the function x 7→ − log(x).)


Exercise 2.6: Given quantizers g1 and g2 , we say that g1 is a finer quantizer than g2 under the
following condition: assume that g1 induces the partition A1 , . . . , An and g2 induces the partition
B1 , . . . , Bm ; then for any of the sets Bi , there are exists some k and sets Ai1 , . . . , Aik such that
Bi = ∪kj=1 Aij . We let g1 ≺ g2 denote that g1 is a finer quantizer than g2 . Prove

(a) Finer partitions increase the KL divergence: if g1 ≺ g2 ,

Dkl (P ||Q | g2 ) ≤ Dkl (P ||Q | g1 ) .

(b) If X is discrete (so P and Q have p.m.f.s p and q) then


X p(x)
Dkl (P ||Q) = p(x) log .
x
q(x)

Exercise 2.7 (f -divergences generalize standard divergences): Show the following properties of
f -divergences:

(a) If f (t) = |t − 1|, then Df (P ||Q) = 2 ∥P − Q∥TV .

(b) If f (t) = t log t, then Df (P ||Q) = Dkl (P ||Q).

(c) If f (t) = t log t − log t, then Df (P ||Q) = Dkl (P ||Q) + Dkl (Q||P ).

44
Lexture Notes on Statistics and Information Theory John Duchi

(d) For any convex f satisfying f (1) = 0, Df (P ||Q) ≥ 0. (Hint: use Jensen’s inequality.)

Exercise 2.8 (Generalized “log-sum” inequalities): Let f : R+ → R be an arbitrary convex


function.
(a) Let ai , bi , i = 1, . . . , n be non-negative reals. Prove that
X n   Pn  X n  
i=1 bi bi
ai f P n ≤ ai f .
a
i=1 i ai
i=1 i=1

(b) Generalizing the preceding result, let a : X → R+ and b : X → R+ , and let µ be a finite
measure on X with respect to which a is integrable. Show that
Z R  Z  
b(x)dµ(x) b(x)
a(x)dµ(x)f R ≤ a(x)f dµ(x).
a(x)dµ(x) a(x)
If you are unfamiliarR with measure theory, prove the following essentially equivalent result: let
u : X → R+ satisfy u(x)dx < ∞. Show that
Z R  Z  
b(x)u(x)dx b(x)
a(x)u(x)dxf R ≤ a(x)f u(x)dx
a(x)u(x)dx a(x)
R
whenever a(x)u(x)dx
R < ∞. (It is possible to demonstrate this remains true under appropriate
limits even when a(x)u(x)dx = +∞, but it is a mess.)
(Hint: use the fact that the perspective of a function f , defined by h(x, t) = tf (x/t) for t > 0, is
jointly convex in x and t (see Proposition B.3.12).
Exercise 2.9 (Data processing and f -divergences I): As with the KL-divergence, given a quantizer
g of the set X , where g induces a partition A1 , . . . , Am of X , we define the f -divergence between
P and Q conditioned on g as
m m
P (g −1 ({i}))
  X  
X P (Ai ) −1
Df (P ||Q | g) := Q(Ai )f = Q(g ({i}))f .
Q(Ai ) Q(g −1 ({i}))
i=1 i=1

Given quantizers g1 and g2 , we say that g1 is a finer quantizer than g2 under the following condition:
assume that g1 induces the partition A1 , . . . , An and g2 induces the partition B1 , . . . , Bm ; then for
any of the sets Bi , there are exists some k and sets Ai1 , . . . , Aik such that Bi = ∪kj=1 Aij . We let
g1 ≺ g2 denote that g1 is a finer quantizer than g2 .

(a) Let g1 and g2 be quantizers of the set X , and let g1 ≺ g2 , meaning that g1 is a finer quantization
than g2 . Prove that
Df (P ||Q | g2 ) ≤ Df (P ||Q | g1 ) .
Equivalently, show that whenever A and B are collections of sets partitioning X , but A is a
finer partition of X than B, that
  X  
X P (B) P (A)
Q(B)f ≤ Q(A)f .
Q(B) Q(A)
B∈B A∈A

(Hint: Use the result of Question 2.8(a)).

45
Lexture Notes on Statistics and Information Theory John Duchi

(b) Suppose that X is countable (or finite) so that P and Q have p.m.f.s p and q. Show that
 
X p(x)
Df (P ||Q) = q(x)f ,
x
q(x)

where on the left we are using the partition definition (2.2.3); you should show that the partition
into discrete parts of X achieves the supremum. You may assume that X is finite. (Though
feel free to prove the result in the case that X is infinite.)

Exercise 2.10 (General data processing inequalities): Let f be a convex function satisfying
f (1) = 0. Let K be a Markov transition kernel from X to Z, that is, K(·, x) is a probability
distribution on Z for each x ∈ X . (Written differently, we have X → Z, and conditioned on X = x,
Z has distribution K(·, x), so that K(A, x) is the probability that Z ∈ A given X = x.)
R R
(a) Define the marginals KP (A) = K(A, x)p(x)dx and KQ (A) = K(A, x)q(x)dx. Show that
Df (KP ||KQ ) ≤ Df (P ||Q) .
Hint: by equation (2.2.3), w.l.o.g. we may assume that Z is finite and Z = {1, . . . , m}; also
recall Question 2.8.
(b) Let X and Y be random variables with joint distribution PXY and marginals PX and PY .
Define the f -information between X and Y as
If (X; Y ) := Df (PXY ||PX × PY ) .
Use part (a) to show the following general data processing inequality: if we have the Markov
chain X → Y → Z, then
If (X; Z) ≤ If (X; Y ).

Exercise 2.11 (Convexity of f -divergences): Prove Proposition 2.2.11. Hint: Use Question 2.8.
Exercise 2.12 (Variational forms of KL divergence): Let P and Q be arbitrary distributions on a
common space X . Prove the following variational representation, known as the Donsker-Varadhan
theorem, of the KL divergence:

Dkl (P ||Q) = sup EP [f (X)] − log EQ [exp(f (X))] .
f :EQ [ef (X) ]<∞

You may assume that P and Q have densities.


Exercise 2.13: Let P and Q have densities p and q with respect to the base measure µ over the
set X . (Recall that this is no loss of generality, as we may take µ = P + Q.) Define the support
supp P := {x ∈ X : p(x) > 0}. Show that
1
Dkl (P ||Q) ≥ log .
Q(supp P )

Exercise 2.14: Let P1 be N(θ1 , Σ1 ) and P2 be N(θ2 , Σ2 ), where Σi ≻ 0 are positive definite
matrices. Give Dkl (P1 ||P2 ).
Exercise 2.15: Let {Pv }v∈V be an arbitrary collection of distributions on a space X and µ be a
probability measure on V. Show that if V ∼ µ and conditional on V = v, we draw X ∼ Pv , then

46
Lexture Notes on Statistics and Information Theory John Duchi

R  R
(a) I(X; V ) = Dkl Pv ||P dµ(v), where P = Pv dµ(v) is the (weighted) average of the Pv . You
may assume that V is discrete if you like.
R 
(b) For any distribution
R Q on X , I(X; V ) = Dkl (Pv ||Q) dµ(v)R − Dkl P ||Q . Conclude that
I(X; V ) ≤ Dkl (Pv ||Q) dµ(v), or, equivalently, P minimizes Dkl (Pv ||Q) dµ(v) over all prob-
abilities Q.

Exercise 2.16 (The triangle inequality for variation distance): Let P and Q be distributions
on X1n = (X1 , . . . , Xn ) ∈ X n , and let Pi (· | xi−1
1 ) be the conditional distribution of Xi given
i−1 i−1
X1 = x1 (and similarly for Qi ). Show that
n
X h i
∥P − Q∥TV ≤ EP Pi (· | X1i−1 ) − Qi (· | X1i−1 ) TV
,
i=1

where the expectation is taken over X1i−1 distributed according to P .


Exercise 2.17: Let h(p) = −p log p − (1 − p) log(1 − p). Show that h(p) ≥ 2 log 2 · min{p, 1 − p}.
Exercise 2.18p(Lin [138], Theorem 8): Let h(p) = −p log p − (1 − p) log(1 − p). Show that
h(p) ≤ 2 log 2 · p(1 − p).
Exercise 2.19 (Proving Pinsker’s inequality via data processing): We work through a proof of
Proposition 2.2.8.(a) using the data processing inequality for f -divergences (Proposition 2.2.13).

(a) Define Dkl (p||q) = p log pq + (1 − p) log 1−p


1−q . Argue that to prove Pinsker’s inequality (2.2.10),
2 1
it is enough to show that (p − q) ≤ 2 Dkl (p||q).

(b) Define the negative binary entropy h(p) = p log p + (1 − p) log(1 − p). Show that

h(p) ≥ h(q) + h′ (q)(p − q) + 2(p − q)2

for any p, q ∈ [0, 1].

(c) Conclude Pinsker’s inequality (2.2.10).

JCD Comment: Below are a few potential questions

Exercise 2.20: Use the paper “A New Metric for Probability Distributions” by Dominik p Endres
and Johannes Schindelin to prove that if V ∼ Uniform{0, 1} and X | V = v ∼ Pv , then I(X; V )
is a metric on distributions. (Said differently, Djs (P ||Q)1/2 is a metric on distributions, and it
generates the same topology as the TV-distance.)
Exercise 2.21: Relate the generalized Jensen-Shannon divergence between m distributions to
redundancy in encoding.

47
Chapter 3

Exponential families and statistical


modeling

Our second introductory chapter focuses on readers who may be less familiar with statistical mod-
eling methodology and the how and why of fitting different statistical models. As in the preceding
introductory chapter on information theory, this chapter will be a fairly terse blitz through the main
ideas. Nonetheless, the ideas and distributions here should give us something on which to hang our
hats, so to speak, as the distributions and models provide the basis for examples throughout the
book. Exponential family models form the basis of much of statistics, as they are a natural step
away from the most basic families of distributions—Gaussians—which admit exact computations
but are brittle, to a more flexible set of models that retain enough analytical elegance to permit
careful analyses while giving power in modeling. A key property is that fitting exponential family
models reduces to the minimization of convex functions—convex optimization problems—an oper-
ation we treat as a technology akin to evaluating a function like sin or cos. This perspective (which
is accurate enough) will arise throughout this book, and informs the philosophy we adopt that once
we formulate a problem as convex, it is solved.

3.1 Exponential family models


We begin by defining exponential family distributions, giving several examples to illustrate a few
of their properties. There are three key objects when defining a d-dimensional exponential family
distribution on an underlying space X : the sufficient statistic ϕ : X → Rd representing what we
model, a canonical parameter vector θ ∈ Rd , and a carrier h : X → R+ .
In the discrete case, where X is a discrete set, the exponential family associated with the
sufficient statistic ϕ and carrier h has probability mass function
pθ (x) = h(x) exp (⟨θ, ϕ(x)⟩ − A(θ)) ,
where A is the log-partition-function, sometimes called the cumulant generating function, with
X
A(θ) := log h(x) exp(⟨θ, ϕ(x)⟩).
x∈X

In the continuous case, pθ is instead a density on X ⊂ Rk , and pθ takes the identical form above
but Z
A(θ) = log h(x) exp(⟨θ, ϕ(x)⟩)dx.
X

48
Lexture Notes on Statistics and Information Theory John Duchi

We can abstract away from this distinction between discrete and continuous distributions by making
the definition measure-theoretic, which we do here for completeness. (But recall the remarks in
Section 1.4.)
With our notation, we have the following definition.

Definition 3.1. The exponential family associated with the function ϕ and base measure µ is
defined as the set of distributions with densities pθ with respect to µ, where

pθ (x) = exp (⟨θ, ϕ(x)⟩ − A(θ)) , (3.1.1)

and the function A is the log-partition-function (or cumulant function)


Z
A(θ) := log exp (⟨θ, ϕ(x)⟩) dµ(x) (3.1.2)
X

whenever A is finite (and is +∞ otherwise). The family is regular if the domain

Θ := {θ | A(θ) < ∞}

is open.

In Definition 3.1, we have included the carrier h in the base measure µ, and frequently we will give
ourselves the general notation

pθ (x) = h(x) exp(⟨θ, ϕ(x)⟩ − A(θ)).

In some scenarios, it may be convient to re-parameterize the problem in terms of some function
η(θ) instead of θ itself; we will not worry about such issues and simply use the formulae that are
most convenient.
We now give a few examples of exponential family models.

Example 3.1.1 (Bernoulli distribution): In this case, we have X ∈ {0, 1} and P (X = 1) = p


for some p ∈ [0, 1] in the classical version of a Bernoulli. Thus we take µ to be the counting
p
measure on {0, 1}, and by setting θ = log 1−p to obtain a canonical representation, we have

P (X = x) = p(x) = px (1 − p)1−x = exp(x log p − x log(1 − p))


 
p  
= exp x log + log(1 − p) = exp xθ − log(1 + eθ ) .
1−p

The Bernoulli family thus has log-partition function A(θ) = log(1 + eθ ). 3

Example 3.1.2 (Poisson distribution): The Poisson distribution (for count data) is usually
parameterized by some λ > 0, and for x ∈ N has distribution Pλ (X = x) = (1/x!)λx e− λ. Thus
by taking µ to be counting (discrete) measure on {0, 1, . . .} and setting θ = log λ, we find the
density (probability mass function in this case)
1 x −λ 1 1
p(x) = λ e = exp(x log λ − λ) = exp(xθ − eθ ) .
x! x! x!
Notably, taking h(x) = (x!)−1 and log-partition A(θ) = eθ , we have probability mass function
pθ (x) = h(x) exp(θx − A(θ)). 3

49
Lexture Notes on Statistics and Information Theory John Duchi

Example 3.1.3 (Normal distribution, mean parameterization): For the d-dimensional normal
distribution, we take µ to be Lebesgue measure on Rd . If we fix the covariance and vary only
the mean µ in the family N(µ, Σ), then X ∼ N(µ, Σ) has density
 
1 ⊤ −1 1
pµ (x) = exp − (x − µ) Σ (x − µ) − log det(2πΣ) .
2 2

Setting h(x) = − 12 x⊤ Σ−1 x and reparameterizing θ = Σ−1 µ, we obtain


   
1 ⊤ −1 1 ⊤ 1 ⊤
pθ (x) = exp − x Σ x − log det(2πΣ) exp x θ − θ Σθ .
2 2 2
| {z }
=:h(x)

In particular, we have carrier h(x) = exp(− 12 x⊤ Σ−1 x)/((2π)d/2 det(Σ)), sufficient statistic
ϕ(x) = x, and log partition A(θ) = 21 θ⊤ Σ−1 θ. 3

Example 3.1.4 (Normal distribution): Let X ∼ N(µ, Σ). We may re-parameterize this as
as Θ = Σ−1 and θ = Σ−1 µ, and we have density
 
1
pθ,Θ (x) ∝ exp ⟨θ, x⟩ − ⟨xx⊤ , Θ⟩ ,
2

where ⟨·, ·⟩ denotes the Euclidean inner product. See Exercise 3.1. 3

In some cases, it is analytically convenient to include a few more conditions on the exponential
family.

Definition 3.2. Let {Pθ }θ∈Θ be an exponential family as in Definition 3.1. The sufficient statistic
ϕ is minimal if Θ = dom A ⊂ Rd is full-dimensional and there exists no vector u such that

⟨u, ϕ(x)⟩ is constant µ-almost surely.

Definition 3.2 is essentially equivalent to stating that ϕ(x) = (ϕ1 (x), . . . , ϕd (x)) has linearly inde-
pendent components when viewed as vectors [ϕi (x)]x∈X . While we do not prove this, via a suitable
linear transformation—a variant of Gram-Schmidt orthonormalization—one may modify any non-
minimal exponential family {Pθ } into an equivalent minimal exponential family {Qη }, meaning
that the two collections satisfy the equality {Pθ } = {Qη } (see Brown [41, Chapter 1]).

3.2 Why exponential families?


There are many reasons for us to study exponential families. The first major reason is their
analytical tractability: as the normal distribution does, they often admit relatively straightforward
computation, therefore forming a natural basis for modeling decisions. Their analytic tractability
has made them the objects of substantial study for nearly the past hundred years; Brown [41]
provides a deep and elegant treatment. Moreover, as we see later, they arise as the solutions to
several natural optimization problems on the space of probability distributions, and they also enjoy
certain robustness properties related to optimal Bayes’ procedures (there is, of course, more to
come on this topic).

50
Lexture Notes on Statistics and Information Theory John Duchi

Here, we enumerate a few of their keyR analytical properties, focusing on the cumulant generating
(or log partition) function A(θ) = log e⟨θ,ϕ(x)⟩ dµ(x). We begin with a heuristic calculation, where
we assume that we exchange differentiation and integration. Assuming that this is the case, we
then obtain the important expectation and covariance relationships that
Z
1
∇A(θ) = R ⟨θ,ϕ(x)⟩ ∇θ e⟨θ,ϕ(x)⟩ dµ(x)
e dµ(x)
Z Z
−A(θ) ⟨θ,ϕ(x)⟩
=e ∇θ e dµ(x) = ϕ(x)e⟨θ,ϕ(x)⟩−A(θ) dµ(x) = Eθ [ϕ(X)]

because e⟨θ,ϕ(x)⟩−A(θ) = pθ (x). A completely similar (and still heuristic, at least at this point)
calculation gives

∇2 A(θ) = Eθ [ϕ(X)ϕ(X)⊤ ] − Eθ [ϕ(X)]Eθ [ϕ(X)]⊤ = Covθ (ϕ(X)).

That these identities hold is no accident and is central to the appeal of exponential family models.
The first and, from our perspective, most important result about exponential family models is
their convexity. While (assuming the differentiation relationships above hold) the differentiation
identity that ∇2 A(θ) = Covθ (ϕ(X)) ⪰ 0 makes convexity of A immediate, one can also provide a
direct argument without appealing to differentiation.

Proposition 3.2.1. The cumulant-generating function θ 7→ A(θ) is convex, and it is strictly convex
if and only if Covθ (ϕ(X)) is positive definite for all θ ∈ dom A.

Proof Let θλ = λθ1 + (1 − λ)θ2 , where θ1 , θ2 ∈ Θ. Then 1/λ ≥ 1 and 1/(1 − λ) ≥ 1, and Hölder’s
inequality implies
Z Z
log exp(⟨θλ , ϕ(x)⟩)dµ(x) = log exp(⟨θ1 , ϕ(x)⟩)λ exp(⟨θ2 , ϕ(x)⟩)1−λ dµ(x)
Z λ Z 1−λ
λ 1−λ
≤ log exp(⟨θ1 , ϕ(x)⟩) dµ(x)
λ exp(⟨θ2 , ϕ(x)⟩) 1−λ dµ(x)
Z Z
= λ log exp(⟨θ1 , ϕ(x)⟩)dµ(x) + (1 − λ) log exp(⟨θ2 , ϕ(x)⟩)dµ(x),

as desired. The strict convexity will be a consequence of Proposition 3.2.2 to come, as there we
formally show that ∇2 A(θ) = Covθ (ϕ(X)).

We now show that A(θ) is indeed infinitely differentiable and how it generates the moments of
the sufficient statistics ϕ(x). To describe the properties, we provide a bit of notation related to
tensor products: for a vector x ∈ Rd , we let

x⊗k := x
| ⊗x⊗ {z· · · ⊗ x}
k times

denote the kth order tensor, or multilinear operator, that for v1 , . . . , vk ∈ Rd satisfies
k
Y
⊗k
x (v1 , . . . , vk ) := ⟨x, v1 ⟩ · · · ⟨x, vk ⟩ = ⟨x, vi ⟩.
i=1

51
Lexture Notes on Statistics and Information Theory John Duchi

When k = 2, this is the familiar outer product x⊗2 = xx⊤ . (More generally, one may think of x⊗k
as a d × d × · · · × d box, where the (i1 , . . . , ik ) entry is [x⊗k ]i1 ,...,ik = xi1 · · · xik .) With this notation,
our first key result regards the differentiability of A, where we can compute (all) derivatives of eA(θ)
by interchanging integration and differentiation.

Proposition 3.2.2. The cumulant-generating function θ 7→ A(θ) is infinitely differentiable on the


interior of its domain Θ := {θ ∈ Rd : A(θ) < ∞}. The moment-generating function
Z
M (θ) := exp(⟨θ, ϕ(x)⟩)dµ(x)

is analytic on the set ΘC := {z ∈ Cd | Re z ∈ Θ}. Additionally, the derivatives of M are computed


by passing through the integral, that is,
Z Z
⟨θ,ϕ(x)⟩
k k
∇θ M (θ) = ∇θ e dµ(x) = ∇kθ e⟨θ,ϕ(x)⟩ dµ(x)
Z
= ϕ(x)⊗k exp(⟨θ, ϕ(x)⟩)dµ(x).

The proof of the proposition is involved and requires complex analysis, so we defer it to Sec. 3.6.1.
As particular consequences of Proposition 3.2.2, we can rigorously demonstrate the expectation
and covariance relationships that
Z Z
1 ⟨θ,ϕ(x)⟩
∇A(θ) = R ⟨θ,ϕ(x)⟩ ∇e dµ(x) = ϕ(x)pθ (x)dµ(x) = Eθ [ϕ(X)]
e dµ(x)

and

( ϕ(x)e⟨θ,ϕ(x)⟩ dµ(x))⊗2
Z R
2 1 ⊗2 ⟨θ,ϕ(x)⟩
∇ A(θ) = R ϕ(x) e dµ(x) − R
e⟨θ,ϕ(x)⟩ dµ(x) ( e⟨θ,ϕ(x)⟩ dµ(x))2
= Eθ [ϕ(X)ϕ(X)⊤ ] − Eθ [ϕ(X)]Eθ [ϕ(X)]⊤
= Covθ (ϕ(X)).

Minimal exponential families (Definition 3.2) also enjoy a few additional regularity properties.
Recall that A is strictly convex if

A(λθ0 + (1 − λ)θ1 ) < λA(θ0 ) + (1 − λ)A(θ1 )

whenever λ ∈ (0, 1) and θ0 , θ1 ∈ dom A. We have the following proposition.

Proposition 3.2.3. Let {Pθ } be a regular exponential family. The log partition function A is
strictly convex if and only if {Pθ } is minimal.

Proof If the family is minimal, then Varθ (u⊤ ϕ(X)) > 0 for any vector u, while Varθ (u⊤ ϕ(X)) =
u⊤ ∇2 A(θ)u. This implies the strict positive definiteness ∇2 A(θ) ≻ 0, which is equivalent to strict
convexity (see Corollary B.3.2 in Appendix B.3.1). Conversely, if ∇2 A(θ) ≻ 0 for all θ ∈ Θ, then
Varθ (u⊤ ϕ(X)) > 0 for all u ̸= 0 and so u⊤ ϕ(x) is non-constant in x.

52
Lexture Notes on Statistics and Information Theory John Duchi

3.2.1 Fitting an exponential family model


The convexity and differentiability properties make exponential family models especially attractive
from a computational perspective. A major focus in statistics is the convergence of estimates of
different properties of a population distribution P and whether these estimates are computable.
We will develop tools to address the first of these questions, and attendant optimality guarantees,
throughout this book. To set the stage for what follows, let us consider what this entails in the
context of exponential family models.
Suppose we have a population P (where, for simplicity, we assume P has a density p), and for
a given exponential family P with densities {pθ }, we wish to find the model closest to P . Then it
is natural (if we take on faith that the information-theoretic measures we have developed are the
“right” ones) find the distribution Pθ ∈ P closest to P in KL-divergence, that is, to solve
Z
p(x)
minimize Dkl (P ||Pθ ) = p(x) log dx. (3.2.1)
θ pθ (x)

This is evidently equivalent to minimizing


Z Z
− p(x) log pθ (x)dx = p(x) [−⟨θ, ϕ(x)⟩ + A(θ)] dx = −⟨θ, EP [ϕ(X)]⟩ + A(θ).

This is always a convex optimization problem (see Appendices B and C for much more on this), as A
is convex and the first term is linear, and so has no non-global optima. Here and throughout, as we
mention in the introductory remarks to this chapter, we treat convex optimization as a technology:
as long as the dimension of a problem is not too large and its objective can be evaluated, it is
(essentially) computationally trivial.
Of course, we never have access to the population P fully; instead, we receive a sample
X1 , . . . , Xn from P . In this case, a natural approach is to replace the expected (negative) log
likelihood above with its empirical version and solve
n
X n
X
minimize − log pθ (Xi ) = [−⟨θ, ϕ(Xi )⟩ + A(θ)], (3.2.2)
θ
i=1 i=1

which is still a convex optimization problem (as the objective is convex in θ). The maximum
likelihood estimate is any vector θbn minimizing the negative log likelihood (3.2.2), which by setting
gradients to 0 is evidently any vector satisfying
n
1X
∇A(θbn ) = Eθbn [ϕ(X)] = ϕ(Xi ). (3.2.3)
n
i=1

In particular, we need only find a parameter θbn matching moments of the empirical distribution
of the observed Xi ∼ P . This θbn is unique whenever Covθ (ϕ(X)) ≻ 0 for all θ, that is, when
the covariance of ϕ is full rank in the exponential family model, because then the objective in the
minimization problem (3.2.2) is strictly convex.
Let us proceed heuristically for a moment to develop a rough convergence guarantee for the
estimator θbn ; the next paragraph assumes a comfort with some of classical asymptotic statistics
(and the central limit theorem) and is not essential for what comes later. Then we can see how
minimizers of the problem (3.2.2) converge to their population counterparts. Assume that the data

53
Lexture Notes on Statistics and Information Theory John Duchi

Xi are i.i.d. from an exponential family model Pθ⋆ . Then we expect that the maximum likelihood
estimate θbn should converge to θ⋆ , and so
n
1X
ϕ(Xi ) = ∇A(θbn ) = ∇A(θ⋆ ) + (∇2 A(θ⋆ ) + o(1))(θbn − θ⋆ ).
n
i=1

But of course, ∇A(θ⋆ ) = Eθ⋆ [ϕ(X)], and so the central limit theorem gives that
n
1X ·
(ϕ(Xi ) − ∇A(θ⋆ )) ∼ N 0, n−1 Covθ⋆ (ϕ(X)) = N 0, n−1 ∇2 A(θ⋆ ) ,
 
n
i=1

·
where ∼ means “is approximately distributed as.” Multiplying by (∇2 A(θ⋆ )+o(1))−1 ≈ ∇2 A(θ⋆ )−1 ,
we thus see (still working in our heuristic)
n
1 X
θbn − θ⋆ = (∇2 A(θ⋆ ) + o(1))−1 (ϕ(Xi ) − ∇A(θ⋆ ))
n
i=1
· −1 2 ⋆ −1

∼ N 0, n · ∇ A(θ ) , (3.2.4)

where we use that BZ ∼ N(0, BΣB ⊤ ) if Z ∼ N(0, Σ). (It is possible to make each of these steps
fully rigorous.) Thus the cumulant generating function A governs the error we expect in θbn − θ⋆ .
Much of the rest of this book explores properties of these types of minimization problems: at
what rates do we expect θbn to converge to a global minimizer of problem (3.2.1)? Can we show
that these rates are optimal? Is this the “right” strategy for choosing a parameter? Exponential
families form a particular working example to motivate this development.

3.3 Divergence measures and information for exponential families


Their nice analytic properties mean that exponential family models also play nicely with the in-
formation theoretic tools we develop. Indeed, consider the KL-divergence between two exponential
family distributions Pθ and Pθ+∆ , where ∆ ∈ Rd . Then we have

Dkl (Pθ ||Pθ+∆ ) = Eθ [⟨θ, ϕ(X)⟩ − A(θ) − ⟨θ + ∆, ϕ(X)⟩ + A(θ + ∆)]


= A(θ + ∆) − A(θ) − Eθ [⟨∆, ϕ(X)⟩]
= A(θ + ∆) − A(θ) − ∇A(θ)⊤ ∆.

Similarly, we have

Dkl (Pθ+∆ ||Pθ ) = Eθ+∆ [⟨θ + ∆, ϕ(X)⟩ − A(θ + ∆) − ⟨θ, ϕ(X)⟩ + A(θ)]
= A(θ) − A(θ + ∆) + Eθ+∆ [⟨∆, ϕ(X)⟩]
= A(θ) − A(θ + ∆) − ∇A(θ + ∆)⊤ (−∆).

These identities give an immediate connection with convexity. Indeed, for a differentiable convex
function h, the Bregman divergence associated with h is

Dh (u, v) = h(u) − h(v) − ⟨∇h(v), u − v⟩, (3.3.1)

54
Lexture Notes on Statistics and Information Theory John Duchi

which is always nonnegative, and is the gap between the linear approximation to the (convex)
function h and its actual value. One might more accurately call the quantity (3.3.1) the “first-
order divergence,” which is more evocative, but the statistical, machine learning, and optimization
literatures—in which such divergences frequently appear—have adopted this terminology, so we
stick with it.
JCD Comment: Put in a picture of a Bregman divergence

We catalog these results as the following proposition.

Proposition 3.3.1. Let {Pθ } be an exponential family model with cumulant generating function
A(θ). Then

Dkl (Pθ ||Pθ+∆ ) = DA (θ + ∆, θ) and Dkl (Pθ+∆ ||Pθ ) = DA (θ, θ + ∆).

Additionally, there exists a t ∈ [0, 1] such that


1
Dkl (Pθ ||Pθ+∆ ) = ∆⊤ ∇2 A(θ + t∆)∆,
2
and similarly, there exists a t ∈ [0, 1] such that
1
Dkl (Pθ+∆ ||Pθ ) = ∆⊤ ∇2 A(θ + t∆)∆.
2
Proof We have already shown the first two statements; the second two are applications of Tay-
lor’s theorem.

When the perturbation ∆ is small, that A is infinitely differentiable then gives that
1
Dkl (Pθ ||Pθ+∆ ) = ∆⊤ ∇2 A(θ)∆ + O(∥∆∥3 ),
2
so that the Hessian ∇2 A(θ) tells quite precisely how the KL divergence changes as θ varies (locally).
As we saw already in Example 2.3.2 (and see the next section), when the KL-divergence between
two distributions is small, it is hard to test between them, and in the sequel, we will show converses
to this. The Hessian ∇2 A(θ⋆ ) also governs the error in the estimate θbn − θ⋆ in our heuristic (3.2.4).
When the Hessian ∇2 A(θ) is quite positive semidefinite, the KL divergence Dkl (Pθ ||Pθ+∆ ) is large,
and the asymptotic covariance (3.2.4) is small. For this—and other reasons we address later—for
exponential family models, we call

∇2 A(θ) = Covθ (ϕ(X)) = Eθ [∇ log pθ (X)∇ log pθ (X)⊤ ] (3.3.2)

the Fisher information of the parameter θ in the model {Pθ }.

3.4 Generalized linear models and regression


We can specialize the general modeling strategies that exponential families provide to more directly
address prediction problems, where we wish to predict a target Y ∈ Y given covariates X ∈ X .
Here, we almost always have that Y is either discrete or continuous with Y ⊂ R. In this case, we

55
Lexture Notes on Statistics and Information Theory John Duchi

have a sufficient statistic ϕ : X × Y → Rd , and we model Y | X = x via the generalized linear model
(or conditional exponential family model) if it has density or probability mass function
 
pθ (y | x) = exp ϕ(x, y)⊤ θ − A(θ | x) h(y), (3.4.1)

where as before h is the carrier and (in the case that Y ⊂ Rk )


Z
A(θ | x) = log exp(ϕ(x, y)⊤ θ)h(y)dy

or, in the discrete case, X


A(θ | x) = log exp(ϕ(x, y)⊤ θ)h(y).
y

The log partition function A(· | x) provides the same insights for the conditional models (3.4.1)
as it does for the unconditional exponential family models in the preceding sections. Indeed, as
in Propositions 3.2.1 and 3.2.2, the log partition A(· | x) is always C ∞ on its domain and convex.
Moreover, it gives the expected moments of the sufficient statistic ϕ conditional on x, as

∇A(θ | x) = Eθ [ϕ(X, Y ) | X = x],

and
∇2 A(θ | x) = Covθ (ϕ(X, Y ) | X = x),
from which we can (typically) extract the mean or other statistics of Y conditional on x.
Three standard examples will be our most frequent motivators throughout this book: linear
regression, binary logistic regression, and multiclass logistic regression. We give these three, as
well as describing two more important examples involving modeling count data through Poisson
regression and making predictions for targets y known to live in a bounded set.

Example 3.4.1 (Linear regression): In linear regression, we wish to predict Y ∈ R from a


vector X ∈ Rd , and assume that Y | X = x follow the normal distribution N(θ⊤ x, σ 2 ). In this
case, we have
 
1 1 ⊤ 2
pθ (y | x) = √ exp − 2 (y − x θ)
2πσ 2 2σ
   
1 ⊤ 1 ⊤ ⊤ 1 2 1 2
= exp yx θ − 2 θ xx θ exp − 2 y + log(2πσ ) ,
σ2 2σ 2σ 2

so that we have the exponential family representation (3.4.1) with ϕ(x, y) = σ12 xy, h(y) =
exp(− 2σ1 2 y 2 + 21 log(2πσ 2 )), and A(θ) = 2σ1 2 θ⊤ xx⊤ θ. As ∇A(θ | x) = Eθ [ϕ(X, Y ) | X = x] =
1
σ2
xEθ [Y | X = x], we easily recover Eθ [Y | X = x] = θ⊤ x. 3

Frequently, we wish to predict binary or multiclass random variables Y . For example, consider
a medical application in which we wish to assess the probability that, based on a set of covariates
x ∈ Rd (say, blood pressure, height, weight, family history) and individual will have a heart attack
in the next 5 years, so that Y = 1 indicates heart attack and Y = −1 indicates not. The next
example shows how we might model this.

56
Lexture Notes on Statistics and Information Theory John Duchi

Example 3.4.2 (Binary logistic regression): If Y ∈ {−1, 1}, we model


exp(yx⊤ θ)
pθ (y | x) = ,
1 + exp(yx⊤ θ)
where the idea in the probability above is that if x⊤ θ has the same sign as y, then the large
x⊤ θy becomes the higher the probability assigned the label y; when x⊤ θy < 0, the probability
is small. Of course, we always have pθ (y | x) + pθ (−y | x) = 1, and using the identity
y+1 ⊤
yx⊤ θ − log(1 + exp(yx⊤ θ)) = x θ − log(1 + exp(x⊤ θ))
2
we obtain the generalized linear model representation ϕ(x, y) = y+12 x and A(θ | x) = log(1 +

exp(x θ)).
As an alternative, we could represent Y ∈ {0, 1} by
exp(yx⊤ θ) 
⊤ x⊤ θ

pθ (y | x) = = exp yx θ − log(1 + e ) ,
1 + exp(x⊤ θ)
which has the simpler sufficient statistic ϕ(x, y) = xy. 3
Instead of a binary prediction problem, in many cases we have a multiclass prediction problem,
where we seek to predict a label Y for an object x belonging to one of k different classes. For
example, in image recognition, we are given an image x and wish to identify the subject Y of the
image, where Y ranges over k classes, such as birds, dogs, cars, trucks, and so on. This too we can
model using exponential families.
Example 3.4.3 (Multiclass logistic regression): In the case that we have a k-class prediction
problem in which we wish to predict Y ∈ {1, . . . , k} from X ∈ Rd , we assign parameters
θy ∈ Rd to each of the classes y = 1, . . . , k. We then model
 
k
exp(θy⊤ x)
X 

pθ (y | x) = Pk = exp θy⊤ x − log eθj x  .
exp(θ ⊤ x)
j=1 j j=1

Here, the idea is that if θy⊤ x > θj⊤ x for all j ̸= y, then the model assigns higher probability to
class y than any other class; the larger the gap between θy⊤ x and θj⊤ x, the larger the difference
in assigned probabilities. 3
Other approaches with these ideas allow us to model other situations. Poisson regression models
are frequent choices for modeling count data. For example, consider an insurance company that
wishes to issue premiums for shipping cargo in different seasons and on different routes, and so
wishes to predict the number of times a given cargo ship will be damaged by waves over a period
of service; we might represent this with a feature vector x encoding information about the ship to
be insured, typical weather on the route it will take, and the length of time it will be in service.
To model such counts Y ∈ {0, 1, 2, . . .}, we turn to Poisson regression.
Example 3.4.4 (Poisson regression): When Y ∈ N is a count, the Poisson distribution with
−λ y ⊤
rate λ > 0 gives P (Y = y) = e y!λ . Poisson regression models λ via eθ x , giving model
1  ⊤

pθ (y | x) = exp yx⊤ θ − eθ x ,
y!
so that we have carrier h(y) = 1/y! and the simple sufficient statistic yx⊤ θ. The log partition

function is A(θ | x) = eθ x . 3

57
Lexture Notes on Statistics and Information Theory John Duchi

Lastly, we consider a less standard example, but which highlights the flexibility of these models.
Here, we assume a linear regression problem but in which we wish to predict values Y in a bounded
range.

Example 3.4.5 (Bounded range regression): Suppose that we know Y ∈ [−b, b], but we wish
to model it via an exponential family model with density

pθ (y | x) = exp(yx⊤ θ − A(θ | x))1 {y ∈ [−b, b]} ,

which is non-zero only for −b ≤ y ≤ b. Letting s = x⊤ θ for shorthand, we have


Z b
1 h bs i
eys dy = e − e−bs ,
−b s

where the limit as s → 0 is 2b; the (conditional) log partition function is thus
bθ ⊤ x −e−bθ ⊤ x
(
log e θ⊤ x
if θ⊤ x ̸= 0
A(θ | x) =
log(2b) otherwise.

While its functional form makes this highly non-obvious, our general results guarantee that
A(θ | x) is indeed C ∞ and convex in θ. We have ∇A(θ | x) = xEθ [Y | X = x] because
ϕ(x, y) = xy, and we can therefore immediately recover Eθ [Y | X = x]. Indeed, set s = θ⊤ x,
and without loss of generality assume s ̸= 0. Then

∂ ebs − e−bs b(ebs + e−bs ) 1


E[Y | x⊤ θ = s] = log = bs − ,
∂s s e − e−bs s

which increases from −b to b as s = x⊤ θ increases from −∞ to +∞. 3

3.4.1 Fitting a generalized linear model from a sample


We briefly revisit the approach in Section 3.2.1 for fitting exponential family models in the context
of generalized linear models. In this case, the analogue of the maximum likelihood problem (3.2.2)
is to solve
Xn n h
X i
minimize − log pθ (Yi | Xi ) = −ϕ(Xi , Yi )⊤ θ + A(θ | Xi ) .
θ
i=1 i=1

This is a convex optimization problem with C ∞ objective, so we can treat solving it as an (essen-
tially) trivial problem unless the sample size n or dimension d of θ are astronomically large.
As in the moment matching equality (3.2.3), a necessary and sufficient condition for θbn to
minimize the above objective is that it achieves 0 gradient, that is,
n n
1X 1X
∇A(θbn | Xi ) = ϕ(Xi , Yi ).
n n
i=1 i=1

Once again, to find θbn amounts to matching moments, as ∇A(θ | Xi ) = E[ϕ(X, Y ) | X = Xi ], and
we still enjoy the convexity properties of the standard exponential family models.
In general, we of course do not expect any exponential family or generalized linear model (GLM)
to have perfect fidelity to the world: all models are inaccurate (but many are useful!). Nonetheless,

58
Lexture Notes on Statistics and Information Theory John Duchi

we can still fit any of the GLM models in Examples 3.4.1–3.4.5 to data of the appropriate type. In
particular, for the logarithmic loss ℓ(θ; x, y) = − log pθ (y | x), we can define the empirical loss
n
1X
Ln (θ) := ℓ(θ; Xi , Yi ).
n
i=1

Then, as n → ∞, we expect that Ln (θ) → E[ℓ(θ; X, Y )], so that the minimizing θ should give the
best predictions possible according to the loss ℓ. We shall therefore often be interested in such
convergence guarantees and the deviations of sample quantities (like Ln ) from their population
counterparts.

3.4.2 The information in a generalized linear model


As we did in Section 3.3, we can compute the “information” about a parameter θ in a generalized
linear model as well. In this case, Pθ specifies only the conditional distribution of Y given X = x,
so when we compute the information, we assume X follows some marginal distribution Q. In this
case,
R (X, Y ) ∼ Pθ ◦ Q, where we have abused composition notation to mean that P(X, Y ∈ A) =
(x,y)∈A pθ (y | x)q(x)dydx. In this case, we have
 
pθ (Y | X) h i
Dkl (Pθ ◦ Q||Pθ+∆ ◦ Q) = Eθ log = Eθ ∆⊤ ϕ(X, Y ) − A(θ | X) + A(θ + ∆ | X)
pθ+∆ (Y | X)
 
= Eθ DA(·|X) (θ + ∆, θ) ,

and similarly
 
Dkl (Pθ+∆ ◦ Q||Pθ ◦ Q) = Eθ+∆ DA(·|X) (θ, θ + ∆) ,

where we recall the Bregman divergence (3.3.1) and have used that Eθ [ϕ(X, Y ) | X] = ∇A(θ | X).
Performing a Taylor expansion, we have
1
A(θ + ∆ | x) = A(θ | x) + ⟨∇A(θ | x), ∆⟩ + ∆⊤ ∇2 A(θ | x)∆ + O(Eθ+t∆ [∥ϕ(X, Y )∥3 | x] · ∥∆∥3 ),
2
where we have computed third derivatives of A(θ | x), and t ∈ [0, 1]. Evaluating the Taylor
expansion in the integral form, once there exists some δ > 0 such that
Z 1 h i
3
EQ Eθ+tv [∥ϕ(X, Y )∥ | X] dt < ∞
0

for any ∥v∥2 ≤ δ, for either of the expansions above we have the following corollary:

Corollary 3.4.6. Assume the marginal distribution Q on X satisfies the above integrability con-
dition. Then as ∆ → 0,
1
Dkl (Pθ ◦ Q||Pθ+∆ ◦ Q) = ∆⊤ EQ [∇2 A(θ | X)]∆ + O(∥∆∥3 )
2
and
1
Dkl (Pθ+∆ ◦ Q||Pθ ◦ Q) = ∆⊤ EQ [∇2 A(θ | X)]∆ + O(∥∆∥3 ).
2

59
Lexture Notes on Statistics and Information Theory John Duchi

In analogy with Proposition 3.3.1, we see again that the expected Hessian EQ [∇2 A(θ | X)] tells
quite precisely how the KL divergence changes as θ varies locally, but now, the distribution Q on
X also enters the picture. So when A and the distribution Q are such that EQ [∇2 A(θ | X)] is large
in the semidefinite order, then it is easy to distinguish data coming from Pθ ◦ Q from that drawn
from Pθ′ ◦ Q, and otherwise, it is not. We therefore call

E[∇2 A(θ | X)] = E[Covθ (ϕ(X, Y ) | X)] = E[∇ log pθ (Y | X)∇ log pθ (Y | X)] (3.4.2)

the Fisher information of the parameter θ in the model Pθ .

Example 3.4.7 (The information in logistic regression): For the binary logistic regression
model (Example 3.4.2) with Y ∈ {0, 1}, we have

ex θ
∇ log pθ (y | x) = yx − x = (y − pθ (1 | x))x
1 + ex⊤ θ

and ∇2 A(θ | X) = pθ (1 | x)(1 − pθ (1 | x))xx⊤ . Thus for X ∼ Q we the Fisher information is


1
EQ [Varθ (Y | X)XX ⊤ ] ⪯ EQ [XX ⊤ ].
4
When θ = 0, we have identically Varθ (Y | X) = 41 , which is the “maximal” information.
Additionally, we see that when X ∼ Q has larger covariance, we expect XX ⊤ to be larger in
the semidefinite order, meaning the observations (X, Y ) contain more information about θ (of
course, this is mitigated by the fact that pθ (y | x) becomes more extreme as ∥x∥ grows). 3

Example 3.4.8 (The KL-divergence in logistic regression): The binary logistic regression
model with Y ∈ {0, 1} also admits simple bounds on its KL-divergence. For these, we first
make the simple observation that for the log-sum-exp function f (t) = log(1 + et ), we have
et
f ′ (t) = 1+e ′′ ′ ′
t and f (t) = f (t)(1 − f (t)). Taylor’s theorem states that

Z ∆

f (t + ∆) = f (t) + f (t)∆ + f ′′ (t + u)(∆ − u)du,
0

∆2
R∆ R∆
and as 0 ≤ f ′′ ≤ 14 , we have | 0 f ′′ (t + u)(∆ − u)du| ≤ 1
4 0 |∆ − u|du = 8 , so

∆2
|f (t + ∆) − f (t) − f ′ (t)∆| ≤
8
for all ∆ ∈ R. Computing the KL-divergence directly, we thus have for any parameters θ0 , θ1
that
1 ⊤ 2
Dkl (Pθ0 (· | x)||Pθ1 (· | x)) = f (θ0⊤ x) − f (θ1⊤ x) − f ′ (θ1⊤ x)x⊤ (θ0 − θ1 ) ≤ x (θ0 − θ1 ) ,
8
so the divergence is at most quadratic. 3

60
Lexture Notes on Statistics and Information Theory John Duchi

3.5 Lower bounds on testing a parameter’s value


We give a bit of a preview here of the tools we will develop to prove fundamental limits in Part II of
the book, an hors d’oeuvres that points to the techniques we develop. In Section 2.3.1, we presented
Le Cam’s method and used it in Example 2.3.2 to give a lower bound on the probability of error in
a hypothesis test comparing two normal means. This approach extends beyond this simple case,
and here we give another example applying it to exponential family models.
We give a stylized version of the problem. Let {Pθ } be an exponential family model with
parameter θ ∈ Rd . Suppose for some vector v ∈ Rd , we wish to test whether v ⊤ θ > 0 or v ⊤ θ < 0 in
the model. For example, in the regression settings in Section 3.4, we may be interested in the effect
of a treatment on health outcomes. Then the covariates x contain information about an individual
with first index x1 corresponding to whether the individual is treated or not, while Y measures the
outcome of treatment; setting v = e1 , we then wish to test whether there is a positive treatment
effect θ1 = e⊤
1 θ > 0 or negative.
Abstracting away the specifics of the scenario, we ask the following question: given an exponen-
tial family {Pθ } and a threshold t of interest, at what separation δ > 0 does it become essentially
impossible to test
v ⊤ θ ≤ t versus v ⊤ θ ≥ t + δ?
We give one approach to this using two-point hypothesis testing lower bounds. In this case, we
consider testing sequences of two alternatives

H0 : θ = θ0 versus H1,n : θ = θn

as n grows, where we observe a sample X1n drawn i.i.d. either according to Pθ0 (i.e., H0 ) or Pθn
(i.e., H1,n ). By choosing θn in a way that makes the separation v ⊤ (θn − θ0 ) large but testing H0
against H1,n challenging, we can then (roughly) identify the separation δ at which testing becomes
impossible.

Proposition 3.5.1. Let θ0 ∈ Rd . Then there exists a sequence of parameters θn with ∥θn − θ0 ∥ =

O(1 n), separation
1
q
v ⊤ (θn − θ0 ) = √ v ⊤ ∇2 A(θ0 )−1 v,
n
and for which
1
inf {Pθ0 (Ψ(X1n ) ̸= 0) + Pθn (Ψ(X1n ) ̸= 1)} ≥ + O(n−1/2 ).
Ψ 2
Proof Let ∆ ∈ Rd be a potential perturbation to θ1 = θ0 + ∆, which gives separation δ =
v ⊤ θ1 − v ⊤ θ0 = v ⊤ ∆. Let P0 = Pθ0 and P1 = Pθ1 . Then the smallest summed probability of error
in testing between P0 and P1 based on n observations X1n is

inf {P0 (Ψ(X1 , . . . , Xn ) ̸= 0) + P1 (Ψ(X1 , . . . , Xn ) ̸= 1)} = 1 − ∥P0n − P1n ∥TV


Ψ

by Proposition 2.3.1. Following the approach of Example 2.3.2, we apply Pinsker’s inequal-
ity (2.2.10) and use that the KL-divergence tensorizes to find

2 ∥P0n − P1n ∥2TV ≤ nDkl (P0 ||P1 ) = nDkl (Pθ0 ||Pθ0 +∆ ) = nDA (θ0 + ∆, θ0 ),

where the final equality follows from the equivalence between KL and Bregman divergences for
exponential families (Proposition 3.3.1).

61
Lexture Notes on Statistics and Information Theory John Duchi

To guarantee that the summed probability of error is at least 21 , that is, ∥P0n − P1n ∥TV ≤ 12 ,
it suffices to choose ∆ satisfying nDA (θ0 + ∆, θ0 ) ≤ 21 . So to maximize the separation v ⊤ ∆ while
guaranteeing a constant probability of error, we (approximately) solve
maximize v ⊤ ∆
1
subject to DA (θ0 + ∆, θ0 ) ≤ 2n .
3
Now, consider that DA (θ0 + ∆, θ0 ) = 12 ∆⊤ ∇2 A(θ0 )∆ + O(∥∆∥ ). Ignoring the higher order term,
we consider maximizing v ⊤ ∆ subject to ∆⊤ ∇2 A(θ0 )∆ ≤ n1 . A Lagrangian calculation shows that
this has solution
1 1
∆= √ p ∇2 A(θ0 )−1 v.
n v ⊤ ∇2 A(θ0 )−1 v
p
With this choice, we have separation δ = v ⊤ ∆ = v ⊤ ∇2 A(θ0 )−1 v/n, and DA (θ0 + ∆, θ0 ) =
1 3/2
2n + O(1/n ). The summed probability of error is at least
r r
n 1 1
n n
1 − ∥P0 − P1 ∥TV ≥ 1 − + O(n −1/2 )=1− + O(n−1/2 ) = + O(n−1/2 )
4n 4 2
as desired.

Let us briefly sketch out why Proposition 3.5.1 is the “right” answer using the heuristics in Sec-
tion 3.2.1. For an unknown parameter θ in the exponential family model Pθ , we observe X1 , . . . , Xn ,
and wish to test whether v ⊤ θ ≥ t for a given threshold t. Call our null H0 : v ⊤ θ ≤ t, and assume
we wish to test at an asymptotic level α > 0, meaning the probability the test falsely rejects H0 is
(as n → ∞) is at most α. Assuming the heuristic (3.2.4), we have the approximate distributional
equality  
· 1
v ⊤ θbn ∼ N v ⊤ θ, v ⊤ ∇2 A(θbn )−1 v .
n
Note that we have θbn on the right side of the distribution; it is possible to make this rigorous, but
here we target only intuition building. A natural asymptotically level α test is then
( q
Reject if v ⊤ θbn ≥ t + z1−α v ⊤ ∇2 A(θbn )−1 v/n
Tn :=
Accept otherwise,
where z1−α is the 1 − α quantile of a standard normal, P(Z ≥ z1−α ) = α for Z ∼ N(0, 1). Let θ0
be such that v ⊤ θ0 = t, so H0 holds. Then

 q 
⊤ b ⊤ 2 −1
Pθ0 (Tn rejects) = Pθ0 n · v (θn − θ0 ) ≥ z1−α v ∇ A(θn ) v → α.
b
p √
At least heuristically, then, this separation δ = v ⊤ A(θ0 )−1 v/ n is the fundamental separation
in parameter values at which testing becomes possible (or below which it is impossible).
As a brief and suggestive aside, the precise growth of the KL-divergence Dkl (Pθ0 +∆ ||Pθ0 ) =
1 ⊤ 2 3
2 ∆ ∇ A(θ0 )∆ + O(∥∆∥ ) near θ0 plays the fundamental role in both the lower bound and upper
bound on testing. When the Hessian ∇2 A(θ0 ) is “large,” meaning it is very positive definite,
distributions with small parameter distances are still well-separated in KL-divergence, making
testing easy, while when ∇2 A(θ0 ) is small (nearly indefinite), the KL-divergence can be small even
for large parameter separations ∆ and testing is hard. As a consequence, at least for exponential
family models, the Fisher information (3.3.2), which we defined as ∇2 A(θ) = Covθ (ϕ(X)), plays a
central role in testing and, as we see later, estimation.

62
Lexture Notes on Statistics and Information Theory John Duchi

3.6 Deferred proofs


We collect proofs that rely on background we do not assume for this book here.

3.6.1 Proof of Proposition 3.2.2


We follow Brown [41]. We demonstrate only the first-order differentiability using Lebesgue’s domi-
nated convergence theorem , as higher orders and the interchange of integration and differentiation
are essentially identical. Demonstrating first-order complex differentiability is of course enough to
show that A is analytic.1 As the proof of Proposition 3.2.1 does not rely on analyticity of A, we
may use its results. Thus, let Θ = dom A(·) in Rd , which is convex. We assume Θ has non-empty
interior (if the interior is empty, then the convexity of Θ means that it must lie in a lower dimen-
sional subspace; we simply take the interior relative to that subspace and may proceed). We claim
the following lemma, which is the key to applying dominated convergence; we state it first for Rd .

Lemma 3.6.1. Consider any collection {θ1 , . . . , θm } ⊂ Θ, and let Θ0 = Conv{θi }m i=1 and C ⊂
int Θ0 . Then for any k ∈ N, there exists a constant K = K(C, k, {θi }) such that for all θ0 ∈ C,

∥x∥k exp(⟨θ0 , x⟩) ≤ K max exp(⟨θj , x⟩).


j≤m

Proof Let B = {u ∈ Rd | ∥u∥ ≤ 1} be the unit ball in Rd . For any ϵ > 0, there exists a K = K(ϵ)
such that ∥x∥k ≤ Keϵ∥x∥ for all x ∈ Rd . As C ⊂ int Conv(Θ0 ), there exists an ϵ > 0 such P that for
all θ0 ∈ C, θ0 + 2ϵB ⊂ Θ0 , and by construction, for any u ∈ B we can write θ0 + 2ϵu = m j=1 λj θj
m ⊤
for some λ ∈ R+ with 1 λ = 1. We therefore have

∥x∥k exp(⟨θ0 , x⟩) ≤ ∥x∥k sup exp(⟨θ0 + ϵu, x⟩)


u∈B
k
= ∥x∥ exp(ϵ ∥x∥) exp(⟨θ0 , x⟩) ≤ K exp(2ϵ ∥x∥) exp(⟨θ0 , x⟩)
= K sup exp(⟨θ0 + 2ϵu, x⟩).
u∈B

But using the convexity of t 7→ exp(t) and that θ0 + 2ϵu ∈ Θ0 , the last quantity has upper bound

sup exp(⟨θ0 + 2ϵu, x⟩) ≤ max exp(⟨θj , x⟩).


u∈B j≤m

This gives the desired claim.

A similar result is possible with differences of exponentials:

Lemma 3.6.2. Under the conditions of Lemma 3.6.1, there exists a K such that for any θ, θ0 ∈ C

e⟨θ,x⟩ − e⟨θ0 ,x⟩


≤ K max e⟨θj ,x⟩ .
∥θ − θ0 ∥ j≤m

Proof We write
exp(⟨θ, x⟩) − exp(⟨θ0 , x⟩) exp(⟨θ − θ0 , x⟩) − 1
= exp(⟨θ0 , x⟩)
∥θ − θ0 ∥ ∥θ − θ0 ∥
1
For complex functions, Osgood’s lemma shows that if A is continuous and holomorphic in each variable individ-
ually, it is holomorphic. For a treatment of such ideas in an engineering context, see, e.g. [101, Ch. 1].

63
Lexture Notes on Statistics and Information Theory John Duchi

so that the lemma is equivalent to showing that


|e⟨θ−θ0 ,x⟩ − 1|
≤ K max exp(⟨θj − θ0 , x⟩).
∥θ − θ0 ∥ j≤m

From this, we can assume without loss of generality that θ0 = 0 (by shifting). Now note that
by convexity e−a ≥ 1 − a for all a ∈ R, so 1 − ea ≤ |a| when a ≤ 0. Conversely, if a > 0, then
d
aea ≥ ea − 1 (note that da (aea ) = aea + ea ≥ ea ), so dividing by ∥x∥, we see that

|e⟨θ,x⟩ − 1| |e⟨θ,x⟩ − 1| max{⟨θ, x⟩e⟨θ,x⟩ , |⟨θ, x⟩|}


≤ ≤ ≤ e⟨θ,x⟩ + 1.
∥θ∥ ∥x∥ |⟨θ, x⟩| |⟨θ, x⟩|
As θ ∈ C, Lemma 3.6.1 then implies that
|e⟨θ,x⟩ − 1|  
≤ ∥x∥ e⟨θ,x⟩ + 1 ≤ K max e⟨θj ,x⟩ ,
∥θ∥ j

as desired.

With the lemmas in hand, we can demonstrate a dominating function for the derivatives. Indeed,
fix θ0 ∈ int Θ and for θ ∈ Θ, define
exp(⟨θ, x⟩) − exp(⟨θ0 , x⟩) − exp(⟨θ0 , x⟩)⟨x, θ − θ0 ⟩ e⟨θ,x⟩ − e⟨θ0 ,x⟩ − ⟨∇e⟨θ0 ,x⟩ , θ − θ0 ⟩
g(θ, x) = = .
∥θ − θ0 ∥ ∥θ − θ0 ∥
Then limθ→θ0 g(θ, x) = 0 by the differentiability of t 7→ et . Lemmas 3.6.1 and 3.6.2 show that if
we take any collection {θj }m
j=1 ⊂ Θ for which θ ∈ int Conv{θj }, then for C ⊂ int Conv{θj }, there
exists a constant K such that
| exp(⟨θ, x⟩) − exp(⟨θ0 , x⟩)|
|g(θ, x)| ≤ + ∥x∥ exp(⟨θ0 , x⟩) ≤ K max exp(⟨θj , x⟩)
∥θ − θ0 ∥ j
Pm
for all θ ∈ C. As maxj e⟨θj ,x⟩ dµ(x) ≤ ⟨θj ,x⟩ dµ(x) < ∞, the dominated convergence
R R
j=1 e
theorem thus implies that Z
lim g(θ, x)dµ(x) = 0,
θ→θ0

and so M (θ) = exp(A(θ)) is differentiable in θ, as


Z 
M (θ) = M (θ0 ) + xe⟨θ0 ,x⟩ dµ(x), θ − θ0 + o(∥θ − θ0 ∥).

It is evident that we have the derivative


Z
∇M (θ) = ∇ exp(⟨θ, x⟩)dµ(x).


Analyticity Over the subset ΘC := {θ + iz | θ ∈ Θ, z ∈ Rd } (where i = −1 is the imaginary
unit), we can extend the preceding results to demonstrate that A is analytic on ΘC . Indeed, we
first simply note that for a, b ∈ R, exp(a + ib) = exp(a) exp(ib) and | exp(a + ib)| = exp(a), i.e.
|ez | = e z for z ∈ C, and so Lemmas 3.6.1 and 3.6.2 follow mutatis-mutandis as in the real case.
These are enough for the application of the dominated convergence theorem above, and we use that
exp(·) is analytic to conclude that θ 7→ M (θ) is analytic on ΘC .

64
Lexture Notes on Statistics and Information Theory John Duchi

3.7 Bibliography

3.8 Exercises
Exercise 3.1: In Example 3.1.4, give the sufficient statistic ϕ and an explicit formula for the log
partition function A(θ, Θ) so that we can write pθ,Θ (x) = exp(⟨θ, ϕ1 (x)⟩ + ⟨Θ, ϕ2 (x)⟩ − A(θ, Θ)).
Exercise 3.2: Consider the binary logistic regression model in Example 3.4.2, and let ℓ(θ; x, y) =
− log pθ (y | x) be the associated log loss.

(i) Give the Hessian ∇2θ ℓ(θ; x, y).

(ii) Let (xi , yi )ni=1 ⊂ Rd × {±1} be a sample. Give a sufficient condition for the minimizer of the
empirical log loss
n
1X
Ln (θ) := ℓ(θ; xi , yi )
n
i=1

to be unique that depends only on the vectors {xi }. Hint. A convex function h is strictly
convex if and only if its Hessian ∇2 h is positive definite.

Exercise 3.3: Give the Fisher information (3.4.2) for each of the following generalized linear
models:

(a) Linear regression (Example 3.4.1).

(b) Poisson regression (Example 3.4.4).

65
Part I

Concentration, information, stability,


and generalization

66
Chapter 4

Concentration Inequalities

In many scenarios, it is useful to understand how a random variable X behaves by giving bounds
on the probability that it deviates far from its mean or median. This can allow us to give prove
that estimation and learning procedures will have certain performance, that different decoding and
encoding schemes work with high probability, among other results. In this chapter, we give several
tools for proving bounds on the probability that random variables are far from their typical values.
We conclude the section with a discussion of basic uniform laws of large numbers and applications
to empirical risk minimization and statistical learning, though we focus on the relatively simple
cases we can treat with our tools.

4.1 Basic tail inequalities


In this first section, we have a simple to state goal: given a random variable X, how does X
concentrate around its mean? That is, assuming w.l.o.g. that E[X] = 0, how well can we bound

P(X ≥ t)?

We begin with the three most classical three inequalities for this purpose: the Markov, Chebyshev,
and Chernoff bounds, which are all instances of the same technique.
The basic inequality off of which all else builds is Markov’s inequality.

Proposition 4.1.1 (Markov’s inequality). Let X be a nonnegative random variable, meaning that
X ≥ 0 with probability 1. Then
E[X]
P(X ≥ t) ≤ .
t
Proof For any random variable, P(X ≥ t) = E[1 {X ≥ t}] ≤ E[(X/t)1 {X ≥ t}] ≤ E[X]/t, as
X/t ≥ 1 whenever X ≥ t.

When we know more about a random variable than that its expectation is finite, we can give
somewhat more powerful bounds on the probability that the random variable deviates from its
typical values. The first step in this direction, Chebyshev’s inequality, requires two moments, and
when we have exponential moments, we can give even stronger results. As we shall see, each of
these results is but an application of Proposition 4.1.1.

67
Lexture Notes on Statistics and Information Theory John Duchi

Proposition 4.1.2 (Chebyshev’s inequality). Let X be a random variable with Var(X) < ∞. Then

Var(X) Var(X)
P(X − E[X] ≥ t) ≤ and P(X − E[X] ≤ −t) ≤
t2 t2
for all t ≥ 0.

Proof We prove only the upper tail result, as the lower tail is identical. We first note that
X − E[X] ≥ t implies that (X − E[X])2 ≥ t2 . But of course, the random variable Z = (X − E[X])2
is nonnegative, so Markov’s inequality gives P(X − E[X] ≥ t) ≤ P(Z ≥ t2 ) ≤ E[Z]/t2 , and
E[Z] = E[(X − E[X])2 ] = Var(X).

If a random variable has a moment generating function—exponential moments—we can give


bounds that enjoy very nice properties when combined with sums of random variables. First, we
recall that
φX (λ) := E[eλX ]
is the moment generating function of the random variable X. Then we have the Chernoff bound.

Proposition 4.1.3. For any random variable X, we have

E[eλX ]
P(X ≥ t) ≤ = φX (λ)e−λt
eλt
for all λ ≥ 0.

Proof This is another application of Markov’s inequality: for λ > 0, we have eλX ≥ eλt if and
only if X ≥ t, so that P(X ≥ t) = P(eλX ≥ eλt ) ≤ E[eλX ]/eλt .

In particular, taking the infimum over all λ ≥ 0 in Proposition 4.1.3 gives the more standard
Chernoff (large deviation) bound
 
P(X ≥ t) ≤ exp inf log φX (λ) − λt .
λ≥0

Example 4.1.4 (Gaussian random variables): When X is a mean-zero Gaussian variable


with variance σ 2 , we have

λ2 σ 2
 
φX (λ) = E[exp(λX)] = exp . (4.1.1)
2

To see this, we compute the integral; we have


Z ∞  
1 1 2
E[exp(λX)] = √ exp λx − 2 x dx
−∞ 2πσ 2 2σ
Z ∞  
2
λ σ 2 1 1 2 2
=e 2 √ exp − 2 (x − λσ x) dx,
−∞ 2πσ 2 2σ
| {z }
=1

because this is simply the integral of the Gaussian density.

68
Lexture Notes on Statistics and Information Theory John Duchi

As a consequence of the equality (4.1.1) and the Chernoff bound technique (Proposi-
tion 4.1.3), we see that for X Gaussian with variance σ 2 , we have
t2 t2
   
P(X ≥ E[X] + t) ≤ exp − 2 and P(X ≤ E[X] − t) ≤ exp − 2
2σ 2σ
λ2 σ 2 2 2 2
for all t ≥ 0. Indeed, we have log φX−E[X] (λ) = 2 , and inf λ { λ 2σ − λt} = − 2σ
t
2 , which is
attained by λ = σt2 . 3

4.1.1 Sub-Gaussian random variables


Gaussian random variables are convenient for their nice analytical properties, but a broader class
of random variables with similar moment generating functions are known as sub-Gaussian random
variables.
Definition 4.1. A random variable X is sub-Gaussian with parameter σ 2 if
 2 2
λ σ
E[exp(λ(X − E[X]))] ≤ exp
2
for all λ ∈ R. We also say such a random variable is σ 2 -sub-Gaussian.
Of course, Gaussian random variables satisfy Definition 4.1 with equality. This would be un-
interesting if only Gaussian random variables satisfied this property; happily, that is not the case,
and we detail several examples.

Example 4.1.5 (Random signs (Rademacher variables)): The random variable X taking
values {−1, 1} with equal property is 1-sub-Gaussian. Indeed, we have
∞ ∞ ∞ ∞
1 X λk 1 X (−λ)k λ2k (λ2 )k
 2
1 1 X X λ
E[exp(λX)] = eλ + e−λ = + = ≤ = exp ,
2 2 2 k! 2 k! (2k)! 2k k! 2
k=0 k=0 k=0 k=0

as claimed. 3

Bounded random variables are also sub-Gaussian; indeed, we have the following example.
Example 4.1.6 (Bounded random variables): Suppose that X is bounded, say X ∈ [a, b].
Then Hoeffding’s lemma states that
λ2 (b − a)2
 
E[eλ(X−E[X]) ] ≤ exp ,
8
so that X is (b − a)2 /4-sub-Gaussian.
We prove a somewhat weaker statement with a simpler argument, while Exercise 4.1 gives
one approach to proving the above statement. First, let ε ∈ {−1, 1} be a Rademacher variable,
so that P(ε = 1) = P(ε = −1) = 21 . We apply a so-called symmetrization technique—a
common technique in probability theory, statistics, concentration inequalities, and Banach
space research—to give a simpler bound. Indeed, let X ′ be an independent copy of X, so that
E[X ′ ] = E[X]. We have

φX−E[X] (λ) = E exp(λ(X − E[X ′ ])) ≤ E exp(λ(X − X ′ ))


   

= E exp(λε(X − X ′ )) ,
 

69
Lexture Notes on Statistics and Information Theory John Duchi

where the inequality follows from Jensen’s inequality and the last equality is a conseqence of
the fact that X − X ′ is symmetric about 0. Using the result of Example 4.1.5,
λ (X − X ′ )
 2  2
λ (b − a)2
  

 
E exp(λε(X − X )) ≤ E exp ≤ exp ,
2 2
where the final inequality is immediate from the fact that |X − X ′ | ≤ b − a. 3
While Example 4.1.6 shows how a symmetrization technique can give sub-Gaussian behavior,
more sophisticated techniques involving explicitly bounding the logarithm of the moment generating
function of X, often by calculations involving exponential tilts of its density. In particular, letting
X be mean zero for simplicity, if we let

ψ(λ) = log φX (λ) = log E[eλX ],

then
E[XeλX ] E[X 2 eλX ] E[XeλX ]2
ψ ′ (λ) = and ψ ′′
(λ) = − ,
E[eλX ] E[eλX ] E[eλX ]2
where we can interchange the order of taking expectations and derivatives whenever ψ(λ) is finite.
Notably, if X has density pX (with respect to any base measure) then the random variable Yλ with
density
eλy
pλ (y) = pX (y)
E[eλX ]
(with respect to the same base measure) satisfies

ψ ′ (λ) = E[Yλ ] and ψ ′′ (λ) = E[Yλ2 ] − E[Yλ ]2 = Var(Yλ ).

One can exploit this in many ways, which the exercises and coming chapters do. As a particular
example, we can give sharper sub-Gaussian constants for Bernoulli random variables.
Example 4.1.7 (Bernoulli random variables): Let X be Bernoulli(p), so that X = 1 with
probability p and X = 0 otherwise. Then a strengthening of Hoeffding’s lemma (also, essen-
tially, due to Hoeffding) is that

σ 2 (p) 2 1 − 2p
log E[eλ(X−p) ] ≤ λ for σ 2 (p) := .
2 2 log 1−p
p

Here we take the limits as p → {0, 21 , 1} and have σ 2 (0) = 0, σ 2 (1) = 0, and σ 2 ( 12 ) = 14 .
Because p 7→ σ 2 (p) is concave and symmetric about p = 12 , this inequality is always sharper
than that of Example 4.1.6. Exercise 4.12 gives one proof of this bound exploiting exponential
tilting. 3
Chernoff bounds for sub-Gaussian random variables are immediate; indeed, they have the same
concentration properties as Gaussian random variables, a consequence of the nice analytical prop-
erties of their moment generating functions (that their logarithms are at most quadratic). Thus,
using the technique of Example 4.1.4, we obtain the following proposition.
Proposition 4.1.8. Let X be a σ 2 -sub-Gaussian. Then for all t ≥ 0 we have
t2
 
P(X − E[X] ≥ t) ∨ P(X − E[X] ≤ −t) ≤ exp − 2 .

70
Lexture Notes on Statistics and Information Theory John Duchi

Chernoff bounds extend naturally to sums of independent random variables, because moment
generating functions of sums of independent random variables become products of moment gener-
ating functions.

Proposition 4.1.9. Let X1 , X2 , . . . , Xn be independent σi2 -sub-Gaussian random variables. Then


n
" #  2 Pn
 X
λ 2
i=1 σi
E exp λ (Xi − E[Xi ]) ≤ exp for all λ ∈ R,
2
i=1
Pn Pn 2
that is, i=1 Xi is i=1 σi -sub-Gaussian.

Proof We assume w.l.o.g. that the Xi are mean zero. We have by independence that and
sub-Gaussianity that
  Xn    n−1
X   2 2   n−1
X 
λ σn
E exp λ Xi = E exp λ Xi E[exp(λXn )] ≤ exp E exp λ Xi .
2
i=1 i=1 i=1

Applying this technique inductively to Xn−1 , . . . , X1 , we obtain the desired result.

Two immediate corollary to Propositions 4.1.8 and 4.1.9 show that sums of sub-Gaussian random
variables concentrate around their expectations. We begin with a general concentration inequality.

Corollary 4.1.10. Let Xi be independent σi2 -sub-Gaussian random variables. Then for all t ≥ 0
(  n n )
t2
X  X  
max P (Xi − E[Xi ]) ≥ t , P (Xi − E[Xi ]) ≤ −t ≤ exp − Pn .
i=1 i=1
2 i=1 σi2

Additionally, the classical Hoeffding bound, follows when we couple Example 4.1.6 with Corol-
lary 4.1.10: if Xi ∈ [ai , bi ], then
n
2t2
X   
P (Xi − E[Xi ]) ≥ t ≤ exp − Pn 2
.
i=1 i=1 (bi − ai )

To give another interpretation of these inequalities, let us assume that Xi are indepenent and
σ 2 -sub-Gaussian. Then we have that
n
nt2
 X   
1
P (Xi − E[Xi ]) ≥ t ≤ exp − 2 ,
n 2σ
i=1
q
1
nt2 2σ 2 log δ
or, for δ ∈ (0, 1), setting exp(− 2σ 2) = δ or t = √
n
, we have that
q
1X
n 2σ 2 log 1δ
(Xi − E[Xi ]) ≤ √ with probability at least 1 − δ.
n n
i=1

There are a variety of other conditions equivalent to sub-Gaussianity, which we capture in the
following theorem.

71
Lexture Notes on Statistics and Information Theory John Duchi

Theorem 4.1.11. Let X be a random variable and σ 2 ≥ 0. The following statements are all
equivalent, meaning that there are numerical constant factors Kj such that if one statement (i) holds
with parameter Ki , then statement (j) holds with parameter Kj ≤ CKi , where C is a numerical
constant.
2
(1) Sub-gaussian tails: P(|X| ≥ t) ≤ 2 exp(− Kt1 σ2 ) for all t ≥ 0.

(2) Sub-gaussian moments: E[|X|k ]1/k ≤ K2 σ k for all k.

(3) Super-exponential moment: E[exp(X 2 /(K3 σ 2 ))] ≤ e.

If in addition X is mean zero, each of these is equivalent to

(4) Sub-gaussian moment generating function: E[exp(λX)] ≤ exp(K4 λ2 σ 2 ) for all λ ∈ R.

Particularly,
q (1) implies (2) with K1 = 1 and K2 ≤ e1/e ; (2) implies (3) with K2 = 1 and
2
K3 = e e−1 < 3; (3) implies (1) with K3 = 1 and K1 = 1/ log 2. For the last part, (3) implies
(4) with K3 = 1 and K4 ≤ 34 , while (4) implies (1) with K4 = 1
2 and K1 ≤ 2.

This result is standard in the literature on concentration and random variables; see Section 4.4.1
for a proof.
For completeness, we can give a tighter result than part (3) of the preceding theorem, giving a
concrete upper bound on squares of sub-Gaussian random variables. The technique used in the ex-
ample, to introduce an independent random variable for auxiliary randomization, is a common and
useful technique in probabilistic arguments (similar to our use of symmetrization in Example 4.1.6).

Example 4.1.12 (Sub-Gaussian squares): Let X be a mean-zero σ 2 -sub-Gaussian random


variable. Then
1
E[exp(λX 2 )] ≤ 1 , (4.1.2)
[1 − 2σ 2 λ]+2
and expression (4.1.2) holds with equality for X ∼ N(0, σ 2 ).
To see this result, we focus on the Gaussian case first and assume (for this case) without
loss of generality (by scaling) that σ 2 = 1. Assuming that λ < 21 , we have
Z Z √
1 1 2 1 − 1−2λ z 2 2π 1
2
E[exp(λZ )] = √ e−( 2 −λ)z dz = √ e 2 dz = √ √ ,
2π 2π 1 − 2λ 2π
the final equality a consequence of the fact that (as we know for normal random variables)
R − 1 z2 √
e 2σ2 dz = 2πσ 2 . When λ ≥ 12 , the above integrals are all infinite, giving the equality in
expression (4.1.2).
For the more general inequality, we recall that if Z is an independent N(0, 1) random
2
variable, then E[exp(tZ)] = exp( t2 ), and so

√ (i)  (ii) 1
E[exp(λX 2 )] = E[exp( 2λXZ)] ≤ E exp(λσ 2 Z 2 ) =

1 ,
[1 − 2σ 2 λ]+2

where inequality (i) follows because X is sub-Gaussian, and (ii) because Z ∼ N(0, 1). 3

72
Lexture Notes on Statistics and Information Theory John Duchi

4.1.2 Sub-exponential random variables


A slightly weaker condition than sub-Gaussianity is for a random variable to be sub-exponential,
which—for a mean-zero random variable—means that its moment generating function exists in a
neighborhood of zero.
Definition 4.2. A random variable X is sub-exponential with parameters (τ 2 , b) if for all λ such
that |λ| ≤ 1/b,  2 2
λ τ
E[eλ(X−E[X]) ] ≤ exp .
2
It is clear from Definition 4.2 that a σ 2 -sub-Gaussian random variable is (σ 2 , 0)-sub-exponential.
A variety of random variables are sub-exponential. As a first example, χ2 -random variables are
sub-exponential with constant values for τ and b:
Example 4.1.13: Let X = Z 2 , where Z ∼ N(0, 1). We claim that
1
E[exp(λ(X − E[X]))] ≤ exp(2λ2 ) for λ ≤ . (4.1.3)
4
1
Indeed, for λ < we have (recall Example 4.1.12) that
2
  (⋆)
1
E[exp(λ(Z − E[Z ]))] = exp − log(1 − 2λ) − λ ≤ exp λ + 2λ2 − λ
2 2

2

where inequality (⋆) holds for λ ≤ 14 , because − log(1 − 2λ) ≤ 2λ + 4λ2 for λ ≤ 14 . 3
As a second example, we can show that bounded random variables are sub-exponential. It is
clear that this is the case as they are also sub-Gaussian; however, in many cases, it is possible to
show that their parameters yield much tighter control over deviations than is possible using only
sub-Gaussian techniques.
Example 4.1.14 (Bounded random variables are sub-exponential): Suppose that X is a
mean zero random variable taking values in [−b, b] with variance σ 2 = E[X 2 ] (note that we are
guaranteed that σ 2 ≤ b2 in this case). We claim that
 2 2
3λ σ 1
E[exp(λX)] ≤ exp for |λ| ≤ . (4.1.4)
5 2b
To see this, we expand ez via
∞ ∞
z 2 X 2z k−2 z2 X 2
ez = 1 + z + =1+z+ zk .
2 k! 2 (k + 2)!
k=2 k=0
P∞ P∞
For k ≥ 0, we have 2
(k+2)! ≤ 1
3k
, so that 2
k=0 (k+2)! z
k ≤ k=0 |z/3|
k = [1 − |z|/3]−1
+ . Thus

1 z2
ez ≤ 1 + z + ,
[1 − |z|/3]+ 2
3
and as |X| ≤ b and |λ| < we therefore obtain
b

λ2 X 2 λ2 σ 2
 
1
E[exp(λX)] ≤ 1 + E[λX] + E ≤1+ .
2 [1 − |λX|/3]+ 1 − |λ|b/3 2

73
Lexture Notes on Statistics and Information Theory John Duchi

1 1
Letting |λ| ≤ 2b implies 1−|λ|b/3 ≤ 56 , and using that 1 + x ≤ ex gives the result.
It is possible to give a slightly tighter result for λ ≥ 0 In this case, we have the bound

λ2 σ 2 X λk−2 bk−2 σ2  
E[exp(λX)] ≤ 1 + + λ2 σ 2 = 1 + 2 eλb − 1 − λb .
2 k! b
k=3
Then using that 1 + x ≤ ex ,
we obtain Bennett’s moment generating inequality, which is that
 2 
λX σ λb
E[e ] ≤ exp e − 1 − λb for λ ≥ 0. (4.1.5)
b2
λ2 b2
Inequality (4.1.5) always holds, and for λb near 0, we have eλb − 1 − λb ≈ 2 . 3
In particular, if the variance σ 2 ≪ b2 , the absolute bound on X, inequality (4.1.4) gives much
tighter control on the moment generating function of X than typical sub-Gaussian bounds based
only on the fact that X ∈ [−b, b] allow.
More broadly, we can show a result similar to Theorem 4.1.11.
Theorem 4.1.15. Let X be a random variable and σ ≥ 0. Then—in the sense of Theorem 4.1.11—
the following statements are all equivalent for suitable numerical constants K1 , . . . , K4 .
(1) Sub-exponential tails: P(|X| ≥ t) ≤ 2 exp(− Kt1 σ ) for all t ≥ 0

(2) Sub-exponential moments: E[|X|k ]1/k ≤ K2 σk for all k ≥ 1.


(3) Existence of moment generating function: E[exp(X/(K3 σ))] ≤ e and E[exp(−X/(K3 σ))] ≤ e.
If in addition X is mean zero, each of these is equivalent to
(4) Sub-exponential moment generating function: E[exp(λX)] ≤ exp(K4 λ2 σ 2 ) for |λ| ≤ K4′ /σ.
In particular, if (2) holds with K2 = 1, then (4) holds with K4 = 2e2 and K4′ = 1
2e .
See Section 4.4.2 for the proof, which is similar to that for Theorem 4.1.11.
While the concentration properties of sub-exponential random variables are not quite so nice
as those for sub-Gaussian random variables (recall Hoeffding’s inequality, Corollary 4.1.10), we
can give sharp tail bounds for sub-exponential random variables. We first give a simple bound on
deviation probabilities.
Proposition 4.1.16. Let X be a mean-zero (τ 2 , b)-sub-exponential random variable. Then for all
t ≥ 0,   2 
1 t t
P(X ≥ t) ∨ P(X ≤ −t) ≤ exp − min , .
2 τ2 b
Proof The proof is an application of the Chernoff bound technique; we prove only the upper tail
as the lower tail is similar. We have
E[eλX ] (i)
 2 2 
λ τ
P(X ≥ t) ≤ ≤ exp − λt ,
eλt 2
inequality (i) holding for |λ| ≤ 1/b. To minimize the last term in λ, we take λ = min{ τt2 , 1/b},
which gives the result.

Comparing with sub-Gaussian random variables, which have b = 0, we see that Proposition 4.1.16
gives a similar result for small t—essentially the same concentration sub-Gaussian random variables—
while for large t, the tails decrease only exponentially in t.
We can also give a tensorization identity similar to Proposition 4.1.9.

74
Lexture Notes on Statistics and Information Theory John Duchi

Proposition 4.1.17. Let X1 , . . . , Xn be independent mean-zero sub-exponential random variables,


where Xi is (σi2 , bi )-sub-exponential. Then for any vector a ∈ Rn , we have
n
" !#  2 Pn
λ 2 2
i=1 ai σi 1
X
E exp λ ai Xi ≤ exp for |λ| ≤ ,
2 b∗
i=1
Pn 2 2
where b∗ = maxi bi |ai |. That is, ⟨a, X⟩ is ( i=1 ai σi , maxi bi |ai |)-sub-exponential.

Proof We apply an inductive technique similar to that used in the proof of Proposition 4.1.9.
1
First, for any fixed i, we know that if |λ| ≤ bi |ai|
, then |ai λ| ≤ b1i and so

λ2 a2i σi2
 
E[exp(λai Xi )] ≤ exp .
2
1
Now, we inductively apply the preceding inequality, which applies so long as |λ| ≤ bi |ai | for all i.
We have
n n n
"  X # Y  2 2 2
Y λ ai σi
E exp λ ai Xi = E[exp(λai Xi )] ≤ exp ,
2
i=1 i=1 i=1

which is our desired result.

As in the case of sub-Gaussian random variables, a combination of the tensorization property—


that the moment generating functions of sums of sub-exponential random variables are well-
behaved—of Proposition 4.1.17 and the concentration inequality (4.1.16) immediately yields the
following Bernstein-type inequality. (See also Vershynin [187].)

Corollary 4.1.18. Let X1 , . . . , Xn be independent mean-zero (σi2 , bi )-sub-exponential random vari-


ables (Definition 4.2). Define b∗ := maxi bi . Then for all t ≥ 0 and all vectors a ∈ Rn , we
have
n n
t2
X  X    
1 t
P ai Xi ≥ t ∨ P ai Xi ≤ −t ≤ exp − min P n 2 2, .
i=1 i=1
2 i=1 ai σi b∗ ∥a∥∞

It is instructive to study the structure of the bound of Corollary 4.1.18. Notably, the bound
is similar to the Hoeffding-type bound of Corollary 4.1.10 (holding for σ 2 -sub-Gaussian random
variables) that
n
!
t2
X 
P ai Xi ≥ t ≤ exp − 2 2 ,
i=1
2 ∥a∥ 2 σ
so that for small t, Corollary 4.1.18 gives sub-Gaussian tail behavior. For large t, the bound is
weaker. However, in many cases, Corollary 4.1.18 can give finer control than naive sub-Gaussian
bounds. Indeed, suppose that the random variables Xi are i.i.d., mean zero, and satisfy Xi ∈ [−b, b]
with probability 1, but have variance σ 2 = E[Xi2 ] ≤ b2 as in Example 4.1.14. Then Corollary 4.1.18
implies that
n
( )!
5 t2
X 
1 t
P ai Xi ≥ t ≤ exp − min , . (4.1.6)
2 6 σ 2 ∥a∥22 2b ∥a∥∞
i=1

75
Lexture Notes on Statistics and Information Theory John Duchi

When applied to a standard mean (and with a minor simplification that 5/12 < 1/3) with ai = n1 ,
t2
we obtain the bound that n1 ni=1 Xi ≤ t with probability at least 1−exp(−n min{ 3σ t
P
2 , 4b }). Written
q
3 log 1δ 4b log 1δ
differently, we take t = max{σ n , n } to obtain
 q 
1X
n  3 log 1δ 4b log 1 
δ
Xi ≤ max σ √ , with probability 1 − δ.
n  n n 
i=1
q √
The sharpest such bound possible via more naive Hoeffding-type bounds is b 2 log 1δ / n, which
has substantially worse scaling.
The exercises ask you to work out further variants of these results, including the sub-exponential
behavior of quadratic forms of Gaussian random vectors. As one particular example, Exercises 4.10
and 4.11 work through the details of proving the following corollary.
Corollary 4.1.19. Let Z ∼ N(0, 1). Then for any µ ∈ R, (µ+Z)2 is (4(1+2µ2 ), 4)-sub-exponential,
and more precisely,
 2 2
λ2

 2 2
 2λ µ
E exp λ (µ + Z) − (µ + 1) ≤ exp + .
1 − 2λ [1 − 2|λ|]+

Additionally, if Z ∼ N(0, I), then for any matrix A and vector b, ∥AZ − b∥22 is sub-exponential with
h   i   1
E exp λ ∥AZ − b∥22 − ∥A∥2Fr − ∥b∥22 ≤ exp 2λ2 (∥A∥2Fr + 2 ∥b∥22 ) for |λ| ≤ .
4 ∥A∥2op

Further conditions and examples


There are a number of examples and conditions sufficient for random variables to be sub-exponential.
One common condition, the so-called Bernstein condition, controls the higher moments of a random
variable X by its variance. In this case, we say that X satisfies the b-Bernstein condition if
k! 2 k−2
|E[(X − µ)k ]| ≤ σ b for k = 3, 4, . . . , (4.1.7)
2
where µ = E[X] and σ 2 = Var(X) = E[X 2 ] − µ2 . In this case, the following lemma controls
the moment generating function of X. This result is essentially present in Theorem 4.1.15, but it
provides somewhat tighter control with precise constants.
Lemma 4.1.20. Let X be a random variable satisfying the Bernstein condition (4.1.7). Then
λ2 σ 2
 
h
λ(X−µ)
i 1
E e ≤ exp for |λ| ≤ .
2(1 − b|λ|) b

Said differently, a random variable satisfying Condition (4.1.7) is ( 2σ, b/2)-sub-exponential.
Proof Without loss of generality we assume µ = 0. We expand the moment generating function
by noting that
∞ ∞
λ2 σ 2 X λk E[X k ] (i) λ2 σ 2 λ2 σ 2 X
E[eλX ] = 1 + + ≤ 1+ + |λb|k−2
2 k! 2 2
k=3 k=3
λ2 σ 2 1
=1+
2 [1 − b|λ|]+

76
Lexture Notes on Statistics and Information Theory John Duchi

where inequality (i) used the Bernstein condition (4.1.7). Noting that 1+x ≤ ex gives the result.

As one final example, we return to Bennett’s inequality (4.1.5) from Example 4.1.14.

Proposition 4.1.21 (Bennett’s inequality). Let Xi be independent mean-zero P random variables


with Var(Xi ) = σi2 and |Xi | ≤ b. Then for h(t) := (1 + t) log(1 + t) − t and σ 2 := ni=1 σi2 , we have
n
!  2  
X σ bt
P Xi ≥ t ≤ exp − 2 h .
b σ2
i=1

Proof We assume without loss of generality that E[X] = 0. Using the standard Chernoff bound
argument coupled with inequality (4.1.5), we see that
n n
! !
X X X σi2  λb 
P Xi ≥ t ≤ exp e − 1 − λb − λt .
b2
i=1 i=1

Letting h(t) = (1 + t) log(1 + t) − t as in the statement of the proposition and σ 2 = ni=1 σi2 , we
P
minimize over λ ≥ 0, setting λ = 1b log(1 + σbt2 ). Substituting into our Chernoff bound application
gives the proposition.

A slightly more intuitive writing of Bennett’s inequality is to use averages, in which case for
1 Pn
σ2 = n i=1 σi2 the average of the variances,
n
!
nσ 2
  
1X bt
P Xi ≥ t ≤ exp − h .
n b σ2
i=1

It is possible to show that


nσ 2 nt2
 
bt
h ≥ ,
b σ2 2σ 2 + 23 bt
which gives rise to the classical Bernstein inequality that
n
! !
1X nt2
P Xi ≥ t ≤ exp − 2 2 . (4.1.8)
n 2σ + 3 bt
i=1

4.1.3 Orlicz norms


Sub-Gaussian and sub-exponential random variables are examples of a broader class of random
variables belonging to what are known as Orlicz-spaces. For these, we take any convex function
ψ : R+ → R+ with ψ(0) = 0 and ψ(t) → ∞ as t ↑ ∞, a class called the Orlicz functions. Then the
Orlicz norm of a random variable X is

∥X∥ψ := inf {t > 0 | E[ψ(|X|/t)] ≤ 1} . (4.1.9)

That this is a norm is not completely trivial, though a few properties are immediate: clearly
∥aX∥ψ = |a| ∥X∥ψ , and we have ∥X∥ψ = 0 if and only if X = 0 with probability 1. The key result
is that in fact, ∥·∥ψ is actually convex, which then guarantees that it is a norm.

77
Lexture Notes on Statistics and Information Theory John Duchi

Proposition 4.1.22. The function ∥·∥ψ is convex on the space of random variables.

Proof Because ψ is convex and non-decreasing, x 7→ ψ(|x|) is convex as well. (Convince yourself
of this.) Thus, its perspective transform pers(ψ)(t, |x|) := tψ(|x|/t) is jointly convex in both t ≥ 0
and x (see Appendix B.3.3). This joint convexity of pers(ψ) implies that for any random variables
X0 and X1 and t0 , t1 ,

E[pers(ψ)(λt0 + (1 − λ)t1 , |λX0 + (1 − λ)X1 |)] ≤ λE[pers(ψ)(t0 , |X0 |)] + (1 − λ)E[pers(ψ)(t1 , |X1 |)].

Now note that E[ψ(|X|/t)] ≤ 1 if and only if tE[ψ(|X|/t)] ≤ t.

Because ∥·∥ψ is convex and positively homogeneous, we certainly have

∥X + Y ∥ψ = 2 ∥(X + Y )/2∥ψ ≤ ∥X∥ψ + ∥Y ∥ψ ,

that is, the triangle inequality holds. This implies that centering a variable can never increase its
norm by much:
∥X − E[X]∥ψ ≤ ∥X∥ψ + ∥E[X]∥ψ ≤ ∥X∥ψ + ∥X∥ψ
by Jensen’s inequality, so that ∥X − E[X]∥ψ ≤ 2 ∥X∥ψ .
We can recover several standard norms on random variables, including some we have already
implicitly used. The first are the classical Lp norms, where we take ψ(t) = tp , where we see that

inf{t > 0 | E[|X|p /tp ] ≤ 1} = E[|X|p ]1/p .

We also have what we term the sub-Gaussian and sub-Exponential norms, which we denote by
considering the functions
ψp (x) := exp (|x|p ) − 1.
These induce the Orlicz ψp -norms, as for p ≥ 1, these are convex (as they are the composition of the
increasing convex function exp(·) applied to the nonnegative convex function | · |p ). Theorem 4.1.11
shows that we have a sub-Gaussian norm

∥X∥ψ2 := inf t > 0 | E[exp(X 2 /t2 )] ≤ 2 ,



(4.1.10)

while Theorem 4.1.15 shows a sub-exponential norm (or Orlicz ψ1 -norm)

∥X∥ψ1 := inf {t > 0 | E[exp(|X|/t)] ≤ 2} . (4.1.11)

Many relationships follow immediately from the definitions (4.1.10) and (4.1.11). For example,
the definition of the ψp -norms immediately implies that a sub-Gaussian random variable (whether
or not it is mean zero) has a sub-exponential square:

Lemma 4.1.23. A random variable X is sub-Gaussian if and only if X 2 is sub-exponential, as

∥X∥2ψ2 = X 2 ψ1
.

Additionally,
∥XY ∥ψ1 ≤ ∥X∥ψ2 ∥Y ∥ψ2 .

78
Lexture Notes on Statistics and Information Theory John Duchi

2 2
Proof We prove only the second statement. Because xy ≤ x2η + ηy2 for any x, y, and any η > 0,
for any t > 0 we have
 2
ηY 2
 
X
E[exp(|XY |/t)] ≤ E exp + ≤ E[exp(X 2 /ηt)]1/2 E[exp(ηY 2 /t)]1/2
2ηt 2t

by the Cauchy-Schwarz inequality. In particular, if we take t = ∥X∥ψ2 ∥Y ∥ψ2 , then the choice η =
∥X∥2ψ2 / ∥Y ∥2ψ2 gives E[exp(X 2 /ηt)] ≤ 2 and E[exp(ηY 2 /t)] ≤ 2, so that E[exp(|XY |/t)] ≤ 2.

By tracing through the arguments in the proofs of Theorems 4.1.11 and 4.1.15, we can also see that
we have the equivalences
1 1
∥X∥ψ2 ≍ sup √ E[|X|k ]1/k and ∥X∥ψ1 ≍ sup E[|X|k ]1/k ,
k∈N k k∈N k

where ≍ denotes upper and lower bounds by numerical constants.


The arguments we use to prove Theorems 4.1.11 and 4.1.15 also show the following result, which
gives explicit constants connecting sub-exponential behavior with the ψ1 -norm.

Corollary 4.1.24. Let X be any random variable with ∥X∥ψ1 < ∞. Then for all t ≥ 0,
 
P(|X| ≥ t) ≤ 2 exp −t/ ∥X∥ψ1

and if E[X] = 0, then X is (8 ∥X∥2ψ1 , 2 ∥X∥ψ1 )-sub-exponential.

Proof The first statement is nearly trivial: we have by the Chernoff bounding method that
h  i
P(|X| ≥ t) ≤ E exp |X|/ ∥X∥ψ1 exp(−t/ ∥X∥ψ1 ) ≤ 2 exp(−t/ ∥X∥ψ1 )

by definition
R ∞ of the ψ1 -norm. For the second, we mimic the proof of Theorem 4.1.15: because
E[Z] = 0 P(Z ≥ t)dt for Z ≥ 0, we have
∞ ∞ ∞
E[|X|k ]
Z Z Z
≤ P(|X|/ ∥X∥ψ1 ≥ t 1/k
)dt = k P(|X|/ ∥X∥ψ1 ≥ u)u k−1
du ≤ 2k uk−1 e−u du
∥X∥kψ1 0 0 0

using the substitution uk = t. Rearranging yields E[|X|k ] ≤ 2 ∥X∥kψ1 Γ(k + 1) = 2 ∥X∥kψ1 k!. Then
computing the moment generating function, we obtain
∞ ∞
h i X λk E[|X|k ] X 2λ2
E exp(λX/ ∥X∥ψ1 ) ≤ 1 + k
≤ 1 + 2 λk = 1 +
∥X∥ψ1 k! 1 − |λ|
k=2 k=2

for |λ| < 1. For |λ| ≤ 12 , we use 1 + x ≤ ex to obtain E[exp(λX/ ∥X∥ψ1 )] ≤ exp(4λ2 ), which is the
desired result.

79
Lexture Notes on Statistics and Information Theory John Duchi

4.1.4 First applications of concentration: random projections


In this section, we investigate the use of concentration inequalities in random projections. As
motivation, consider nearest-neighbor (or k-nearest-neighbor) classification schemes. We have a
sequence of data points as pairs (ui , yi ), where the vectors ui ∈ Rd have labels yi ∈ {1, . . . , L},
where L is the number of possible labels. Given a new point u ∈ Rd that we wish to label, we find
the k-nearest neighbors to u in the sample {(ui , yi )}ni=1 , then assign u the majority label of these
k-nearest neighbors (ties are broken randomly). Unfortunately, it can be prohibitively expensive to
store high-dimensional vectors and search over large datasets to find near vectors; this has motivated
a line of work in computer science on fast methods for nearest neighbors based on reducing the
dimension while preserving essential aspects of the dataset. This line of research begins with Indyk
and Motwani [117], and continuing through a variety of other works, including Indyk [116] and
work on locality-sensitive hashing by Andoni et al. [7], among others. The original approach is due
to Johnson and Lindenstrauss, who used the results in the study of Banach spaces [123]; our proof
follows a standard argument.
The most specific variant of this problem is as follows: we have n points u1 , . . . , un , and we
could like to construct a mapping Φ : Rd → Rm , where m ≪ d, such that

∥Φui − Φuj ∥2 ∈ (1 ± ϵ) ∥ui − uj ∥2 .

Depending on the norm chosen, this task may be impossible; for the Euclidean (ℓ2 ) norm, however,
such an embedding is easy to construct using Gaussian random variables and with m = O( ϵ12 log n).
This embedding is known as the Johnson-Lindenstrauss embedding. Note that this size m is
independent of the dimension d, only depending on the number of points n.

Example 4.1.25 (Johnson-Lindenstrauss): Let the matrix Φ ∈ Rm×d be defined as follows:


iid
Φij ∼ N(0, 1/m),

and let Φi ∈ Rd denote the ith row of this matrix. We claim that
 
8 1
m ≥ 2 2 log n + log implies ∥Φui − Φuj ∥22 ∈ (1 ± ϵ) ∥ui − uj ∥22
ϵ δ
log n
for all pairs ui , uj with probability at least 1 − δ. In particular, m ≳ ϵ2
is sufficient to achieve
accurate dimension reduction with high probability.
To see this, note that for any fixed vector u,
m
⟨Φi , u⟩ ∥Φu∥22 X
∼ N(0, 1/m), and = ⟨Φi , u/ ∥u∥2 ⟩2
∥u∥2 ∥u∥22 i=1

is a sum of independent scaled χ2 -random variables. In particular, we have E[∥Φu/ ∥u∥2 ∥22 ] = 1,
and using the χ2 -concentration result of Example 4.1.13 yields
   
P ∥Φu∥22 / ∥u∥22 − 1 ≥ ϵ = P m ∥Φu∥22 / ∥u∥22 − 1 ≥ mϵ
mϵ2
 
2

≤ 2 inf exp 2mλ − λmϵ = 2 exp − ,
|λ|≤ 41 8

80
Lexture Notes on Statistics and Information Theory John Duchi

the last inequality holding for ϵ ∈ [0, 1]. Now, using the union bound applied to each of the
pairs (ui , uj ) in the sample, we have
mϵ2
   
  n
P there exist i ̸= j s.t. ∥Φ(ui − uj )∥22 − ∥ui − uj ∥22 ≥ ϵ ∥ui − uj ∥22 ≤ 2 exp − .
2 8
2
Taking m ≥ ϵ82 log nδ = 16ϵ2
log n + ϵ82 log 1δ yields that with probability at least 1 − δ, we have
∥Φui − Φuj ∥22 ∈ (1 ± ϵ) ∥ui − uj ∥22 . 3
Computing low-dimensional embeddings of high-dimensional data is an area of active research,
and more recent work has shown how to achieve sharper constants [63] and how to use more struc-
tured matrices to allow substantially faster computation of the embeddings Φu (see, for example,
Achlioptas [2] for early work in this direction, and Ailon and Chazelle [5] for the so-called “Fast
Johnson-Lindenstrauss transform”).

4.1.5 A second application of concentration: codebook generation


We now consider a (very simplified and essentially un-implementable) view of encoding a signal for
transmission and generation of a codebook for transmitting said signal. Suppose that we have a set
of words, or signals, that we wish to transmit; let us index them by i ∈ {1, . . . , m}, so that there are
m total signals we wish to communicate across a binary symmetric channel Q, meaning that given
an input bit x ∈ {0, 1}, Q outputs a z ∈ {0, 1} with Q(Z = x | x) = 1 − ϵ and Q(Z = 1 − x | x) = ϵ,
for some ϵ < 21 . (For simplicity, we assume Q is memoryless, meaning that when the channel is
used multiple times on a sequence x1 , . . . , xn , its outputs Z1 , . . . , Zn are conditionally independent:
Q(Z1:n = z1:n | x1:n ) = Q(Z1 = z1 | x1 ) · · · Q(Zn = zn | xn ).)
We consider a simplified block coding scheme, where we for each i we associate a codeword
xi ∈ {0, 1}d , where d is a dimension (block length) to be chosen. Upon sending the codeword over
the channel, and receiving some z rec ∈ {0, 1}d , we decode by choosing
i∗ ∈ argmax Q(Z = z rec | xi ) = argmin ∥z rec xi ∥1 , (4.1.12)
i∈[m] i∈[m]

the maximum likelihood decoder. We now investigate how to choose a collection {x1 , . . . , xm }
of such codewords and give finite sample bounds on its probability of error. In fact, by using
concentration inequalities, we can show that a randomly drawn codebook of fairly small dimension
is likely to enjoy good performance.
Intuitively, if our codebook {x1 , . . . , xm } ⊂ {0, 1}d is well-separated, meaning that each pair of
words xi , xk satisfies ∥xi − xk ∥1 ≥ cd for some numerical constant c > 0, we should be unlikely to
make a mistake. Let us make this precise. We mistake word i for word k only if the received signal
Z satisfies ∥Z − xi ∥1 ≥ ∥Z − xk ∥1 , and letting J = {j ∈ [d] : xij ̸= xkj } denote the set of at least
c · d indices where xi and xk differ, we have
X
∥Z − xi ∥1 ≥ ∥Z − xk ∥1 if and only if |Zj − xij | − |Zj − xkj | ≥ 0.
j∈J

If xi is the word being sent and xi and xk differ in position j, then |Zj − xij | − |Zj − xkj | ∈ {−1, 1},
and is equal to −1 with probability (1 − ϵ) and 1 with probability ϵ. That is, we have ∥Z − xi ∥1 ≥
∥Z − xk ∥1 if and only if
X
|Zj − xij | − |Zj − xkj | + |J|(1 − 2ϵ) ≥ |J|(1 − 2ϵ) ≥ cd(1 − 2ϵ),
j∈J

81
Lexture Notes on Statistics and Information Theory John Duchi

and the expectation EQ [|Zj − xij | − |Zj − xkj | | xi ] = −(1 − 2ϵ) when xij ̸= xkj . Using the Hoeffding
bound, then, we have

|J|(1 − 2ϵ)2 cd(1 − 2ϵ)2


   
Q(∥Z − xi ∥1 ≥ ∥Z − xk ∥1 | xi ) ≤ exp − ≤ exp − ,
2 2

where we have used that there are at least |J| ≥ cd indices differing between xi and xk . The
probability of making a mistake at all is thus at most m exp(− 12 cd(1 − 2ϵ)2 ) if our codebook has
separation c · d.
For low error decoding to occur with extremely high probability, it is thus sufficient to choose
a set of code words {x1 , . . . , xm } that is well separated. To that end, we state a simple lemma.

Lemma 4.1.26. Let Xi , i = 1, . . . , m be drawn independently and uniformly on the d-dimensional


hypercube Hd := {0, 1}d . Then for any t ≥ 0,

 m2
   
d m
exp −2dt2 ≤ exp −2dt2 .

P ∃ i, j s.t. ∥Xi − Xj ∥1 < − dt ≤
2 2 2

Proof First,
n let us consider
o two independent draws X and X ′ uniformly on the hypercube. Let
Pd
Z = j=1 1 Xj ̸= Xj′ = dham (X, X ′ ) = ∥X − X ′ ∥1 . Then E[Z] = d2 . Moreover, Z is an i.i.d.
1
sum of Bernoulli 2 random variables, so that by our concentration bounds of Corollary 4.1.10, we
have
2t2
   
′ d
P X − X 1 ≤ − t ≤ exp − .
2 d
Using a union bound gives the remainder of the result.

Rewriting the lemma slightly, we may take δ ∈ (0, 1). Then


r !
d 1
P ∃ i, j s.t. ∥Xi − Xj ∥1 < − d log + d log m ≤ δ.
2 δ

As a consequence of this lemma, we see two things:

(i) If m ≤ exp(d/16), or d ≥ 16 log m, then taking δ ↑ 1, there at least exists a codebook


{x1 , . . . , xm } of words that are all separated by at least d/4, that is, ∥xi − xj ∥1 ≥ d4 for all
i, j.

(ii) By taking m ≤ exp(d/32), or d ≥ 32 log m, and δ = e−d/32 , then with probability at least
1 − e−d/32 —exponentially large in d—a randomly drawn codebook has all its entries separated
by at least ∥xi − xj ∥1 ≥ d4 .

Summarizing, we have the following result: choose a codebook of m codewords x1 , . . . , xm uniformly


at random from the hypercube Hd = {0, 1}d with

8 log m
 
δ
d ≥ max 32 log m, .
(1 − 2ϵ)2

Then with probability at least 1 − 1/m over the draw of the codebook, the probability we make a
mistake in transmission of any given symbol i over the channel Q is at most δ.

82
Lexture Notes on Statistics and Information Theory John Duchi

4.2 Martingale methods


The next set of tools we consider constitute our first look at argument sbased on stability, that is,
how quantities that do not change very much when a single observation changes should concentrate.
In this case, we would like to understand more general quantities than sample means, developing a
few of the basic cools to understand when functions f (X1 , . . . , Xn ) of independent random variables
Xi concentrate around their expectations. Roughly, we expect that if changing the value of one xi
does not significantly change f (xn1 ) much—it is stable—then it should exhibit good concentration
properties.
To develop the tools to do this, we go throuhg an approach based on martingales, a deep subject
in probability theory. We give a high-level treatment of martingales, taking an approach that does
not require measure-theoretic considerations, providing references at the end of the chapter. We
begin by providing a definition.
Definition 4.3. Let M1 , M2 , . . . be an R-valued sequence of random variables. They are a martin-
gale if there exist another sequence of random variables {Z1 , Z2 , . . .} ⊂ Z and sequence of functions
fn : Z n → R such that
E[Mn | Z1n−1 ] = Mn−1 and Mn = fn (Z1n )
for all n ∈ N. We say that the sequence Mn is adapted to {Zn }.
In general, the sequence Z1 , Z2 , . . . is a sequence of increasing σ-fields F1 , F2 , . . ., and Mn is Fn -
measurable, but Definition 4.3 is sufficienet for our purposes. We also will find it convenient to
study differences of martingales, so that we make the following
Definition 4.4. Let D1P
, D2 , . . . be a sequence of random variables. They form a martingale differ-
ence sequence if Mn := ni=1 Di is a martingale.
Equivalently, there is a sequence of random variables Zn and functions gn : Z n → R such that

E[Dn | Z1n−1 ] = 0 and Dn = gn (Z1n )

for all n ∈ N.
There are numerous examples of martingale sequences. The classical one is the symmetric
random walk.
Example 4.2.1: Let Dn ∈ {±1} be uniform and independent. Then Dn form a martingale
difference sequence adapted to themselves (that is, we may take Zn = Dn ), and Mn = ni=1 Di
P
is a martingale. 3
A more sophisticated example, to which we will frequently return and that suggests the potential
usefulness of martingale constructions, is the Doob martingale associated with a function f .
Example 4.2.2 (Doob martingales): Let f : X n → R be an otherwise arbitrary function,
and let X1 , . . . , Xn be arbitrary random variables. The Doob martingale is defined by the
difference sequence
Di := E[f (X1n ) | X1i ] − E[f (X1n ) | X1i−1 ].
By inspection, the Di are functions of X1i , and we have

E[Di | X1i−1 ] = E[E[f (X1n ) | X1i ] | X1i−1 ] − E[f (X1n ) | X1i−1 ]


= E[f (X1n ) | X1i−1 ] − E[f (X1n ) | X1i−1 ] = 0

83
Lexture Notes on Statistics and Information Theory John Duchi

by the tower property of expectations. Thus, the Di satisfy Definition 4.4 of a martingale
difference sequence, and moreover, we have
n
X
Di = f (X1n ) − E[f (X1n )],
i=1

and so the Doob martingale captures exactly the difference between f and its expectation. 3

4.2.1 Sub-Gaussian martingales and Azuma-Hoeffding inequalities


With these motivating ideas introduced, we turn to definitions, providing generalizations of our
concentration inequalities for sub-Gaussian sums to sub-Gaussian martingales, which we define.
Definition 4.5. Let {Dn } be a martingale difference sequence adapted to {Zn }. Then Dn is a
σn2 -sub-Gaussian martingale difference if
 2 2
n−1 λ σn
E[exp(λDn ) | Z1 ] ≤ exp
2
for all n and λ ∈ R.
Immediately from the definition, we have the Azuma-Hoeffding inequalities, which generalize
the earlier tensorization identities for sub-Gaussian random variables.
Theorem 4.2.3 (Azuma-Hoeffding).
Pn Pn Let {Dn } be a σn2 -sub-Gaussian martingale difference se-
2
quence. Then Mn = i=1 Di is i=1 σi -sub-Gaussian, and moreover,
nt2
 
max {P(Mn ≥ t), P(Mn ≤ −t)} ≤ exp − for all t ≥ 0.
2 ni=1 σi2
P

Proof The proof is essentially immediate: letting Zn be the sequence to which the Dn are
adapted, we write
" n #
Y
λDi
E[exp(λMn )] = E e
i=1
" " n ##
Y
=E E eλDi | Z1n−1
i=1
" "n−1 # #
Y
λDi n−1 λDn n−1
=E E e | Z1 E[e | Z1 ]
i=1

because D1 , . . . , Dn−1 are functions of Z1n−1 . Then we use Definition 4.5, which implies that
2 2
E[eλDn | Z1n−1 ] ≤ eλ σn /2 , and we obtain
"n−1 #
λ2 σn2
Y  
λDi
E[exp(λMn )] ≤ E e exp .
2
i=1

Repeating the same argument for n − 1, n − 2, . . . , 1 gives that


n
λ2 X 2
log E[exp(λMn )] ≤ σi
2
i=1

84
Lexture Notes on Statistics and Information Theory John Duchi

as desired.
The second claims are simply applications of Chernoff bounds via Proposition 4.1.8 and that
E[Mn ] = 0.

As an immediate corollary, we recover


Pn Proposition 4.1.9, as sums of independent random vari-
ables form martingales via Mn = i=1 (Xi − E[Xi ]). A second corollary gives what is typically
termed the Azuma inequality:

Corollary 4.2.4. LetPDi be a bounded difference martingale difference sequence, meaning that
|Di | ≤ c. Then Mn = ni=1 Di satisfies

t2
 
−1/2 −1/2
P(n Mn ≥ t) ∨ P(n Mn ≤ −t) ≤ exp − 2 for t ≥ 0.
2c

Thus, bounded random walks are (with high probability) within ± n of their expectations after
n steps.
There exist extensions of these inequalities to the cases where we control the variance of the
martingales; see Freedman [96].

4.2.2 Examples and bounded differences


We now develop several example applications of the Azuma-Hoeffding inequalities (Theorem 4.2.3),
applying them most specifically to functions satisfying certain stability conditions.
We first define the collections of functions we consider.

Definition 4.6 (Bounded differences). Let f : X n → R for some space X . Then f satisfies
bounded differences with constants ci if for each i ∈ {1, . . . , n}, all xn1 ∈ X n , and x′i ∈ X we have
i−1 ′
|f (xi−1 n n
1 , xi , xi+1 ) − f (x1 , xi , xi+1 )| ≤ ci .

The classical inequality relating bounded differences and concentration is McDiarmid’s inequal-
ity, or the bounded differences inequality.

Proposition 4.2.5 (Bounded differences inequality). Let f : X n → R satisfy bounded Pdifferences


with constants ci , and let Xi be independent random variables. f (X1n ) − E[f (X1n )] is 14 ni=1 c2i -sub-
Gaussian, and

2t2
 
n n n n
P (f (X1 ) − E[f (X1 )] ≥ t) ∨ P (f (X1 ) − E[f (X1 )] ≤ −t) ≤ exp − Pn 2 .
i=1 ci

Proof The basic idea is to show that the Doob martingale (Example 4.2.2) associated with f is
c2i /4-sub-Gaussian, and then to simply apply the Azuma-HoeffdingPn inequality. nTo that end, define
n i n i−1 n
Di = E[f (X1 ) | X1 ] − E[f (X1 ) | X1 ] as before, and note that i=1 Di = f (X1 ) − E[f (X1 )]. The
random variables

Li := inf E[f (X1n ) | X1i−1 , Xi = x] − E[f (X1n ) | X1i−1 ]


x
Ui := sup E[f (X1n ) | X1i−1 , Xi = x] − E[f (X1n ) | X1i−1 ]
x

85
Lexture Notes on Statistics and Information Theory John Duchi

evidently satisfy Li ≤ Di ≤ Ui , and moreover, we have



Ui − Li ≤ sup sup E[f (X1n ) | X1i−1 = x1i−1 , Xi = x] − E[f (X1n ) | X1i−1 = xi−1

1 , Xi = x ]
xi−1 x,x′
1
Z
i−1 ′ n
f (xi−1 n n

= sup sup 1 , x, xi+1 ) − f (x1 , x , xi+1 ) dP (xi+1 ) ≤ ci ,
xi−1 x,x′
1

where we have used the independence of the Xi and Definition 4.6 of bounded differences. Conse-
quently, we have by Hoeffding’s Lemma (Example 4.1.6) that E[eλDi | X1i−1 ] ≤ exp(λ2 c2i /8), that
is, the Doob martingale is c2i /4-sub-Gaussian.
The remainder of the proof is simply Theorem 4.2.3.

A number of quantities satisfy the conditions of Proposition 4.2.5, and we give two examples
here; we will revisit them more later.
Example 4.2.6 (Bounded random vectors): Let B be a Banach space—a complete normed
vector space—with norm ∥·∥. Let Xi be independent bounded random vectors in B satisfying
E[Xi ] = 0 and ∥Xi ∥ ≤ c. We claim that the quantity
n
1X
f (X1n ) := Xi
n
i=1

satisfies bounded differences. Indeed, we have by the triangle inequality that

i−1 ′ n 1 2c
|f (xi−1 n
1 , x, xi+1 ) − f (x1 , x , xi+1 )| ≤ x − x′ ≤ .
n n
Consequently, if Xi are indpendent, we have
n n
!
nt2
   
1X 1X
P Xi − E Xi ≥ t ≤ 2 exp − 2 (4.2.1)
n n 2c
i=1 i=1

for all t ≥ 0. That is, the norm of (bounded) random vectors in an essentially arbitrary vector
space concentrates extremely quickly about its expectation.
The challenge becomes to control the expectation term in the concentration bound (4.2.1),
which can be a bit challenging. In certain cases—for example, when we have a Euclidean
structure on the vectors Xi —it can be easier. Indeed, let us specialize to the case that Xi ∈ H,
a (real) Hilbert space, so that there is an inner product ⟨·, ·⟩ and the norm satisfies ∥x∥2 = ⟨x, x⟩
for x ∈ H. Then Cauchy-Schwarz implies that
 Xn 2  X n 2 X Xn
E Xi ≤E Xi = E[⟨Xi , Xj ⟩] = E[∥Xi ∥2 ].
i=1 i=1 i,j i=1

That is assuming the Xi are independent and E[∥Xi ∥2 ] ≤ σ 2 , inequality (4.2.1) becomes

nt2
     
σ σ
P X n ≥ √ + t + P X n ≤ − √ − t ≤ 2 exp − 2
n n 2c

where X n = n1 ni=1 Xi . 3
P

86
Lexture Notes on Statistics and Information Theory John Duchi

We can specialize Example 4.2.6 to a situation that is very important for treatments of concen-
tration, sums of random vectors, and generalization bounds in machine learning.
Example 4.2.7 (Rademacher complexities): This example is actually a special case of Ex-
ample 4.2.6, but its frequent uses justify a more specialized treatment and consideration. Let
X be some space, and let F be some collection of functions f : X → R. Let εi ∈ {−1, 1} be a
collection of independent random sign vectors. Then the empirical Rademacher complexity of
F is
n
" #
1 X
Rn (F | xn1 ) := E sup εi f (xi ) ,
n f ∈F i=1
where the expectation is over only the random signs
P εi . (In some cases, depending on context
and convenience, one takes the absolute value | i εi f (xi )|.) The Rademacher complexity of
F is
Rn (F) := E[Rn (F | X1n )],
the expectation of the empirical Rademacher complexities.
If f : X → [b0 , b1 ] for all f ∈ F, then the Rademacher complexity satisfies bounded
differences, because for any two sequences xn1 and z1n differing in only element j, we have
 n
X 
n n
n|Rn (F | x1 )−Rn (F | z1 )| ≤ E sup εi (f (xi )−f (zi )) = E[sup εi (f (xj )−f (zj ))] ≤ b1 −b0 .
f ∈F i=1 f ∈F

(b1 −b0 )2
Consequently, the empirical Rademacher complexity satisfies Rn (F | X1n ) − Rn (F) is 4n -
sub-Gaussian by Theorem 4.2.3. 3
These examples warrant more discussion, and it is possible to argue that many variants of these
random variables are well-concentrated. For example, instead of functions we may simply consider
an arbitrary set A ⊂ Rn and define the random variable
n
X
Z(A) := sup⟨a, ε⟩ = sup ai εi .
a∈A a∈A i=1

As a function of the random signs εi , we may write Z(A) = f (ε), and this is then a function
satisfying |f (ε) − f (ε′ )| ≤ supa∈A |⟨a, ε − ε′ ⟩|, so that ′
Pn if ε and ε differ in index i, we have |f (ε) −

f (ε )| ≤ 2 supa∈A |ai |. That is, Z(A) − E[Z(A)] is i=1 supa∈A |ai |2 -sub-Gaussian.

Example 4.2.8 (Rademacher complexity as a random vector): This view of Rademacher


complexity shows how we may think of Rademacher complexities as norms on certain spaces.
Indeed, if we consider a vector space L of linear functions on F, then we can define the F-
seminorm on L by ∥L∥F := supf ∈F |L(f )|. In this case, we may consider the symmetrized
empirical distributions
n n
1X 1X
Pn0 := εi 1Xi f 7→ Pn0 f := εi f (Xi )
n n
i=1 i=1

as elements of this vector space L. (Here we have used 1Xi to denote the point mass at Xi .)
Then the Rademacher complexity is nothing more than the expected norm of Pn0 , a random
vector, as in Example 4.2.6. This view is somewhat sophisticated, but it shows that any general
results we may prove about random vectors, as in Example 4.2.6, will carry over immediately
to versions of the Rademacher complexity. 3

87
Lexture Notes on Statistics and Information Theory John Duchi

4.3 Matrix concentration


In this section, we will develop analogues of the concentration inequalities for sums in Section 4.1,
including matrix Hoeffding and Bernstein inequalities. Our main goal will be to bound maximal
eigenvalues
Pn (or operator norms) of symmetric and Hermitian matrices, that is, for sums Sn =
X
i=1 i of independent matrices, of deviation probabilities

P(λmax (Sn ) ≥ t) or P(λmin (Sn ) ≤ −t),

where λmax and λmin denote maximal and minimal eigenvalues, respectively. Our approach will
be to generalize the approach using moment generating functions, though this becomes non-trivial
because there is no immediately obvious analogue of the tensorization identities we have for scalars.
While in the scalar case, for a sum Sn = ni=1 Xi of independent random variables, we have
P

n
Y
λSn
e = eλXi ,
i=1

such an identify fails for matrices, because their exponentials (typically) fail to commute.
To develop the basic matrix concentration equalities we provide, we require a brief review of
matrix calculus and operator functions. We shall typically work with Hermitian matrices A ∈ Cd×d ,
meaning that A = A∗ , where A∗ denotes the Hermitian transpose of A, whose entries are (A∗ )ij =
Aji , the conjugate of Aji . We work in this generality for two reasons: first, because such matrices
admit the spectral decompositions we require to develop the operators we use, and second, because
dist
we often will encounter random matrices with symmetric distributions, meaning that X = −X,
which can lead to confusion.
With this, we give a brief review of some properties of Hermitian matrices and some associated
matrix operators. Let Hd := {A ∈ Cd×d | A∗ = A} be the Hermitian matrices. The spectral
theorem states gives that any A ∈ Hd admits the spectral decomposition A = U ΛU ∗ , where Λ is
the diagonal matrix of the (necessarily) real eigenvalues of A and U ∈ Cn×n is unitary, so that
U ∗ U = U U ∗ = I. For a function f : R → R, we can then define its operator extension to Hd by

f (A) := U diag (f (λ1 (A)), . . . , f (λn (A))) U ∗ ,

where A has spectral decomposition A = U ΛU ∗ and λi (A) denotes the ith eigenvalue of A. Because
we wish to mimic the approach based on moment generating functions that yields our original sub-
Gaussian and sub-Exponential concentration inequalities in Chapter 4, the most important function
for us will be the exponential, which evidently satisfies

X 1 k
exp(A) = A ,
k!
k=0

where we recall the convention that A0 = I whenever A is Hermitian.


A Hermitian matrix A is positive definite, denoted A ≻ 0, if x∗ Ax > 0 for all x ̸= 0, and is
positive semidefinite (PSD), which we denote by A ⪰ 0, if x∗ Ax ≥ 0 for all vectors x. Positive
definiteness is then equivalent to the condition that λi (A) > 0 for all eigenvalues of A, while
semidefiniteness that λi (A) ≥ 0. We also use the standard semidefinite ordering, so that A ⪰ B
means
Pd that A − B ⪰ 0. For A ∈ Hd , we evidently have exp(A) ≻ 0. The familiar trace tr(A) =
j=1 jj of a square matrix allows us to define inner products, where for general complex matrices
A

88
Lexture Notes on Statistics and Information Theory John Duchi

A, B ∈ Cm×n we define ⟨A, B⟩ = tr(A∗ B), while the the space of Hermitian matrices admits the
real inner product ⟨A, B⟩ := tr(A ∗
Pd B). (See Exercise 4.14.) The spectral theorem also shows the
standard identity that tr(A) = j=1 λj (A) for A ∈ Hd .
To analogize our approach with real-valued random variables, we begin with the Chernoff bound,
Proposition 4.1.3. Here, we have the following observation:

Proposition 4.3.1. For any random Hermitian matrix X,

P(λmax (X) ≥ t) ≤ tr(E[eλX ])e−λt

for all λ ≥ 0 and t ≥ 0.

Proof First, apply the standard Chernoff bound to the random variable λmax (X), which gives
that for any λ > 0 that
P(λmax (X) ≥ t) ≤ E[eλ·λmax (X) ]e−λt .
Then observe that by definition of the matrix exponential, we have eλ·λmax (X) ≤ tr(eλX ), because
the eigenvalues of eλX are all positive.

We would like now to provide some type of general tensorization identity for matrices, in analogy
with Propositions 4.1.9 or 4.1.17. Unfortunately, this breaks down: for Hermitian A, B, we have

eA+B = eA eB

if and only if A and B commute [153], so that they are simultaneously diagonalizable. Nonetheless,
we have the following inequality, which will be the key to extending the standard one-dimensional
approach to concentration:

Proposition 4.3.2 (The Golden-Thompson inequality). Let A, B be Hermitian matrices. Then

tr(eA+B ) ≤ tr(eA eB ).

While the proof is essentially elementary, it is not central to our development, so we defer it to
Section 4.4.4. We remark in passing that there is a converse [153, Section 3]: tr(eA+B ) = tr(eA eB ) if
and only if AB = BA, that is, A and B are simultaneously diagonalizable. With Proposition 4.3.2 in
hand, however, we can develop matrix analogues of the Hoeffding and Bernstein-type concentration
bounds in Chapter 4.
We begin with Azuma-Hoeffding-type bounds, which analogize Theorem 4.2.3. The key to allow
an iterative “peeling” off of individual terms in a sum of random matrices is the following result:

Lemma 4.3.3 (A matrix symmetrization inequality). Let H be an arbitrary (fixed) Hermitian


matrix and X be a mean-zero Hermitian matrix. Then

tr(E[eH+X ]) ≤ tr(E[eH+2εX ]).

Proof Let X ′ be an independent copy of X. Then because the trace exponential tr(eX ) is convex
on the Hermitian matrices (see Exercise 4.16), we have
′ ′
tr(E[eH+X ]) = tr(E[eH+(X−E[X ]) ]) ≤ tr(E[eH+X−X ])

89
Lexture Notes on Statistics and Information Theory John Duchi

dist
by Jensen’s inequality. Introducing the random sign ε ∈ {±1}, we have by symmetry that X−X ′ =
ε(X − X ′ ), and so
′ ′
tr(E[eH+X ]) ≤ E[tr(eH+εX−εX )] = E[tr(eH/2+εX+H/2−εX )]

≤ E[tr(eH/2+εX eH/2−εX )],

where the second inequality follows from Proposition 4.3.2. Now we use that for Hermitian matrices
A, B, we have tr(AB) ≤ ∥A∥2 ∥B∥2 = tr(A2 )1/2 tr(B 2 )1/2 , so that
′ ′ ′
E[tr(eH/2+εX eH/2−εX )] ≤ E[tr(eH+2εX )1/2 tr(eH−2εX )1/2 ] ≤ E[tr(eH+2εX )]1/2 E[tr(eH−2εX )]1/2
dist
by Cauchy-Schwarz. Because εX = −εX ′ , the lemma follows.

This allows us to perform the type of “peeling-off” argument, addressing one term in the sum
at a type, that gives tight enough moment generating function bounds.

Pn 4.3.4. Let X1 , . . . , Xn ∈ Hd be independent, mean-zero, and satisfy ∥Xi ∥op ≤ bi . Define


Theorem
Sn = i=1 Xi . Then for all λ ∈ R,
n
!
X
E[tr(eλSn )] ≤ d exp 2λ2 b2i .
i=1

Proof By iterated expectation and Lemma 4.3.3, we have

tr(E[eλSn ]) ≤ tr(E[eλSn−1 +2λεXn ]) ≤ tr(E[eλSn−1 ]) E[e2λεXn ] ,


op

where ε ∈ {±1} is an independent random sign and we have used independence. Now, we use the
following calculation: if X is Hermitian and ε a random sign, then
2 /2
E[eεX ] ⪯ E[eX ]. (4.3.1)

Temporarily deferring the argument for inequality (4.3.1), note that it immediately implies E[e2λεX ] ⪯
2 2
E[e2λ X ]. The convexity of the operator norm and that ∥Xn ∥op ≤ bn then imply
 
2λεXn 2λ2 Xn2 2λ2 ∥Xn2 ∥op 2 2
E[e ] ≤ E[e ] ≤E e ≤ e2λ bn
op op

Repeating the argument by iteratively peeling off the last term Xn−i for Sn−2 through S1 then
yields
Yn
E[tr(exp(λSn ))] ≤ tr(I) exp(2λ2 b2n ),
i=1
which gives the theorem.
To see inequality (4.3.1), note that for any positive semidefinite A, we have A ⪯ tA for t ≥ 1.
Then because X 2k ⪰ 0 for all k ∈ N and (2k)! ≥ 2k k!, we have
∞ ∞
εX
X E[X 2k ] X E[(X 2 )k ] 2 /2
E[e ]=I+ ⪯I+ = E[eX ],
(2k!) 2k k!
k=1 k=1

90
Lexture Notes on Statistics and Information Theory John Duchi

where we used symmetry to eliminate terms with odd powers.

Theorem 4.3.4 immediately implies the following corollary, whose argument parallels those in
Chapter 4 (e.g., Corollary 4.1.10).
Corollary 4.3.5. Let Xi ∈ Hd be independent mean-zero Hermitian matrices. Then Sn := ni=1 Xi
P
satisfies
t2
   
P ∥Sn ∥op ≥ t ≤ 2d exp − Pn 2 .
8 i=1 bi
If we have more direct bounds on E[eλXi ], then we can also employ those via a similar “peeling
off” the last term argument. By carefully controlling matrix moment generating functions in a way
similar to that we did in Example 4.1.14 to obtain sub-exponential behavior for bounded random
variables, we can give a matrix Bernstein-type inequality.
Theorem 4.3.6. Let Xi be independent Hermitian matrices with ∥Xi ∥op ≤ b and E[Xi2 ] op
≤ σi2 .
Then Sn = ni=1 Xi satisfies
P

t2
  
  3t
P ∥Sn ∥op ≥ t ≤ 2d exp − min , .
4 ni=1 σi2 4b
P

The proof of the theorem is similar to that of Theorem 4.3.4, so we leave it as an extended exercise
(Exercise 4.17).
We unpack the theorem a bit to give some intuition. Given a variance bound σ 2 such that
E[Xi2 ] ⪯ σ 2 I, the theorem states that
  2 

−1
 nt 3nt
P n Sn op ≥ t ≤ 2d exp − min , .
4σ 2 4b
q

Letting δ ∈ (0, 1) be arbitrary and setting t = max{ √ n
log 2d 4b d
δ , 3n log δ }, we have
n
( r )
1X 2σ 2d 4b d
Xi ≤ max √ log , log
n n δ 3n δ
i=1 op
with probability at least 1 − δ. So we see the familiar sub-Gaussian and sub-exponential scaling of
the random sum.

4.4 Technical proofs


4.4.1 Proof of Theorem 4.1.11
(1) implies (2) Let K1 = 1. Using the change
R ∞ofk−1
variables identity that for a nonnegative random
k
variable Z and any k ≥ 1 we have E[Z ] = k 0 t P(Z ≥ t)dt, we find
Z ∞ Z ∞  2 Z ∞
t
k
E[|X| ] = k k−1
t P(|X| ≥ t)dt ≤ 2k t k−1
exp − 2 dt = kσ k uk/2−1 e−u du,
0 0 σ 0

where for the last inequality we made the substitution u = t2 /σ 2 . Noting that this final integral is
Γ(k/2), we have E[|X|k ] ≤ kσ k Γ(k/2). Because Γ(s) ≤ ss for s ≥ 1, we obtain
p √
E[|X|k ]1/k ≤ k 1/k σ k/2 ≤ e1/e σ k.
Thus (2) holds with K2 = e1/e .

91
Lexture Notes on Statistics and Information Theory John Duchi

1 k
(2) implies (3) Let σ = supk≥1 k − 2 E[|X|k ]1/k , so that K2 = 1 and E[|X|k ] ≤ k 2 σ for all k. For
K3 ∈ R+ , we thus have
∞ ∞ ∞ 
E[X 2k ] σ 2k (2k)k (i) X 2e k
X X 
E[exp(X 2 /(K3 σ 2 ))] = ≤ ≤
k!K32k σ 2k k!K32k σ 2k K32
k=0 k=0 k=0

where inequality (i) follows because k! ≥ (k/e)k , or 1/k! ≤ (e/k)k . Noting that ∞ k 1
P
p k=0 α = 1−α ,
we obtain (3) by taking K3 = e 2/(e − 1) ≈ 2.933.

(3) implies (4) Let us take K3 = 1 and recall the assumption of (4) that E[X] = 0. We claim
that (4) holds with K4 = 34 . We prove this result for both small and large λ. First, note the
9x2
(non-standard but true!) inequality that ex ≤ x + e for all x. Then we have
16

 2 2 
9λ X
E[exp(λX)] ≤ E[λX] +E exp
| {z } 16
=0
4
Now note that for |λ| ≤ we have 9λ2 σ 2 /16 ≤ 1, and so by Jensen’s inequality,
3σ ,
  2 2   
9λ X 2
2 2
2 9λ16σ 9λ2 σ 2
E exp = E exp(X /σ ) ≤ e 16 .
16
λ2 cx2
For large λ, we use the simpler Fenchel-Young inequality, that is, that λx ≤ 2c + 2 , valid for all
c ≥ 0. Then we have for any 0 ≤ c ≤ 2 that
  2 
λ2 σ 2 cX λ2 σ 2 c
E[exp(λX)] ≤ e 2c E exp ≤ e 2c e 2 ,
2σ 2
4 1 9 2 2
where the final inequality follows from Jensen’s inequality. If |λ| ≥ 3σ , then 2 ≤ 32 λ σ , and we
have  2 2
1
[ 2c 9c 2 2
+ 32 ]λ σ 3λ σ
E[exp(λX)] ≤ inf e = exp .
c∈[0,2] 4

(3) implies (1) Assume (3) holds with K3 = 1. Then for t ≥ 0 we have
λt2
 
P(|X| ≥ t) = P(X 2 /σ 2 ≥ t2 /σ 2 ) ≤ E[exp(λX 2 /σ 2 )] exp − 2
σ
for all λ ≥ 0. For λ ≤ 1, Jensen’s inequality implies E[exp(λX 2 /σ 2 )] ≤ E[exp(X 2 /σ 2 )]λ ≤ eλ by
assumption (3). Set λ = log 2 ≈ .693.

1
(4) implies (1) This is the content of Proposition 4.1.8, with K4 = 2 and K1 = 2.

4.4.2 Proof of Theorem 4.1.15


(1) implies (2) As in R∞the proof of Theorem 4.1.11, we use that for a nonnegative random variable
Z we have E[Z k ] = k 0 tk−1 P(Z ≥ t)dt. Let K1 = 1. Then
Z ∞ Z ∞ Z ∞
k k−1 k−1 k
E[|X| ] = k t P(|X| ≥ t)dt ≤ 2k t exp(−t/σ)dt = 2kσ uk−1 exp(−u)du,
0 0 0

where we used the substitution u = t/σ. Thus we have E[|X|k ] ≤ 2Γ(k + 1)σ k , and using Γ(k + 1) ≤
k k yields E[|X|k ]1/k ≤ 21/k kσ, so that (2) holds with K2 ≤ 2.

92
Lexture Notes on Statistics and Information Theory John Duchi

(2) implies (3) Let K2 = 1, and note that


∞ ∞ ∞ 
E[X k ] k k 1 (i) X e k
X X 
E[exp(X/(K3 σ))] = ≤ · ≤ ,
k=0
K3k σ k k! k=0 k! K3k k=0
K3

where inequality (i) used that k! ≥ (k/e)k . Taking K3 = e2 /(e − 1) < 5 gives the result.

(3) implies (1) If E[exp(X/σ)] ≤ e, then for t ≥ 0

P(X ≥ t) ≤ E[exp(X/σ)]e−t/σ ≤ e1−t/σ .

With the same result for the negative tail, we have


2t
P(|X| ≥ t) ≤ 2e1−t/σ ∧ 1 ≤ 2e− 5σ ,

so that (1) holds with K1 = 52 .

(2) if and only if (4) Assume that (2) holds with K2 = 1, and let σ = supk≥1 k1 E[|X|k ]1/k . Then
because E[X] = 0,
∞ ∞ ∞
X λk E[X k ] X (k|λ|σ)k X
E[exp(λX)] ≤ 1 + ≤1+ ≤1+ (e|λ|σ)k ,
k! k!
k=2 k=2 k=2

1
where we have used that k! ≥ (k/e)k . When |λ| < σe , evaluating the geometric series yields

(eλσ)2
E[exp(λX)] ≤ 1 + .
1 − e|λ|σ
1
For |λ| ≤ 2eσ , we obtain E[eλX ] ≤ 1 + 2e2 σ 2 λ2 , and as 1 + x ≤ ex this implies (4).
For the opposite direction, assume (4) holds with K4 = K4′ = 1. Then E[exp(λX/σ)] ≤ exp(1)
for λ ∈ [−1, 1], and (3) holds. The preceding parts imply the remainder of the equivalence.

4.4.3 Proof of Theorem 5.1.6


JCD Comment: I would like to write this. For now, check out Ledoux and Talagrand
[135, Theorem 4.12] or Koltchinskii [127, Theorem 2.2].

4.4.4 Proof of Proposition 4.3.2


The key insight is to rewrite the matrix exponential eA+B as a limit of sums of matrices, then work
more directly with traces of powers. To that end, we shall use the Lie product formula

lim (exp(A/n) exp(B/n))n = exp(A + B). (4.4.1)


n→∞

We leave the proof of the equality (4.4.1) as Exercise 4.15. Using it, however, it is evidently sufficient
to prove that there exists some sequence of integers n → ∞ where along this sequence,
 n 
tr eA/n eB/n ≤ tr(eA eB ). (4.4.2)

93
Lexture Notes on Statistics and Information Theory John Duchi

Now recall that the Schatten p-norm of a matrix A is ∥A∥p := tr((AA∗ )p/2 )1/p = ∥γ(A)∥p ,
the
P ℓp -norm of its singular values, where p = 2 gives the Euclidean or Frobenius norm ∥A∥2 =
2 1/2
( i,j |Aij | ) . This norm gives a generalized Hölder-type inequality for powers of 2, that is,
n ∈ {2k }k∈N , which we can in turn use to prove the Golden-Thompson inequality. In particular,
we demonstrate that for n a power of 2,

| tr(A1 · · · An )| ≤ ∥A1 ∥n · · · ∥An ∥n . (4.4.3)

To see this inequality, we proceed inductively. Because the trace defines the inner product
⟨A, B⟩ = tr(A∗ B), for n = 2, the Cauchy-Schwarz inequality implies

| tr(A1 A2 )| = |⟨A∗1 , A2 ⟩| ≤ ∥A1 ∥2 ∥A2 ∥2 .

We now perform an induction, where we have demonstrated the base case n = 2. Then for n ≥ 4
a power of 2, we have by the inductive hypothesis that inequality (4.4.3) holds for n/2 that

| tr(A1 · · · An )| ≤ ∥A1 A2 ∥n/2 · · · ∥An−1 An ∥n/2 .

Now consider an arbitrary pair of matrices A, B. We will demonstrate that ∥AB∥n/2 ≤ ∥A∥n ∥B∥n ,
which will then evidently imply inequality (4.4.3). For these, we have
 
n/2 ∗ ∗ ∗ ∗ ∗ ∗ n/4
∥AB∥n/2 = tr(ABB
| A ·
{z· · ABB A} ) = tr (A ABB )
n/4 times

by the cyclic property of the trace. Using the inductive hypothesis again with n/4 copies of each
of the matrices AT A and BB T , we thus have
 
n/2
∥AB∥n/2 ≤ tr (A∗ ABB ∗ )n/4
 1/2  1/2
n/4 n/4
≤ ∥A∗ A∥n/2 ∥BB ∗ ∥n/2 = tr (A∗ A)n/2 tr (BB ∗ )n/2 = ∥A∥n/2 n/2
n ∥B∥n .

That is, we have ∥AB∥n/2 ≤ ∥A∥n ∥B∥n for any A, B as desired, giving inequality (4.4.3).
We apply inequality (4.4.3) to powers of products of Hermitian matrices A, B. We have
     
tr ((AB)n ) ≤ ∥AB∥nn = tr (ABB ∗ A∗ )n/2 = tr (A∗ ABB ∗ )n/2 = tr (A2 B 2 )n/2

because A = A∗ and B = B ∗ . Recognizing that A2 and B 2 are Hermitian, we repeat this argument
to obtain    
tr (A2 B 2 )n/2 ≤ tr (A4 B 4 )n/4 ≤ · · · ≤ tr(An B n )

for any n ∈ {2k }k∈N . Replacing A and B by eA and eB , which are both symmetric, we obtain

tr (eA eB )n ≤ tr(enA enB ) for n ∈ {2k }k∈N .




This is inequality (4.4.2) once we replace A and B by A/n and B/n.

94
Lexture Notes on Statistics and Information Theory John Duchi

4.5 Bibliography
A few references on concentration, random matrices, and entropies include Vershynin’s extraordi-
narily readable lecture notes [187], upon which our proof of Theorem 4.1.11 is based, the compre-
hensive book of Boucheron, Lugosi, and Massart [37], and the more advanced material in Buldygin
and Kozachenko [43]. Many of our arguments are based off of those of Vershynin and Boucheron
et al. Kolmogorov and Tikhomirov [126] introduced metric entropy.
We give weaker versions of the matrix-Hoeffding and matrix-Bernstein inequalities. It is possible
to do much better.
Ahlswede-Winter developed the matrix concentration inequalities and Petz [153].
I took the proof of Golden-Thompson from Terry Tao’s blog.
Lemma 4.3.3 is [181, Lemma 7.6].
It is possible to obtain better concentration guarantees using Lieb’s concavity inequality that

f (A) := tr (exp(H + log A)) (4.5.1)

is a concave function in A ≻ 0.

4.6 Exercises
Exercise 4.1 (Concentration of bounded random variables): Let X be a random variable taking
values in [a, b], where −∞ < a ≤ b < ∞. In this question, we show Hoeffding’s Lemma, that is,
that X is sub-Gaussian: for all λ ∈ R, we have
 2
λ (b − a)2

E[exp(λ(X − E[X]))] ≤ exp .
8
(b−a)2
(a) Show that Var(X) ≤ ( b−a 2
2 ) = 4 for any random variable X taking values in [a, b].

(b) Let
ϕ(λ) = log E[exp(λ(X − E[X]))].
Assuming that E[X] = 0 (convince yourself that this is no loss of generality) show that

E[X 2 etX ] E[XetX ]2


ϕ(0) = 0, ϕ′ (0) = 0, ϕ′′ (t) = − .
E[etX ] E[etX ]2
(You may assume that derivatives and expectations commute, which they do in this case.)

(c) Construct a random variable Yt , defined for t ∈ R, such that Yt ∈ [a, b] and

Var(Yt ) = ϕ′′ (t).

(You may assume X has a density for simplicity.)


λ2 (b−a)2
(d) Using the result of part (c), show that ϕ(λ) ≤ 8 for all λ ∈ R.

Exercise 4.2 (Variance lower bounds on sub-Gaussian parameters):


2 2
(a) Let X be σ 2 -sub-Gaussian, that is, E[exp(λX)] ≤ exp( λ 2σ ) for all λ ∈ R. Show that E[X] = 0
and E[X 2 ] ≤ σ 2 .

95
Lexture Notes on Statistics and Information Theory John Duchi

(b) Let X be a random variable with E[X] = 0 and Var(X) = σ 2 > 0. Show that
1
lim inf log E[exp(λX)] > 0.
|λ|→∞ |λ|

2
Exercise 4.3 (Mills ratio): Let ϕ(t) = √12π e−t /2 be the density of a standard Gaussian, Z ∼
Rt
N(0, 1), and Φ(t) = −∞ ϕ(u)du its cumulative distribution function.

(a) Show that P(Z ≥ t) ≤ 1t ϕ(t) for all t > 0.

(b) Define
t
g(t) := 1 − Φ(t) − ϕ(t).
t2 + 1
Show that g(0) = 0, g ′ (t) < 0 for all t ≥ 0, and that limt→∞ g(t) = 0.

(c) Conclude that for all t ≥ 0,


t 1
ϕ(t) ≤ P(Z ≥ t) ≤ ϕ(t).
t2 +1 t

Exercise 4.4 (Likelihood ratio bounds and concentration): Consider a data release problem,
where given a sample x, we release a sequence of data Z1 , Z2 , . . . , Zn belonging to a discrete set Z,
where Zi may depend on Z1i−1 and x. We assume that the data has limited information about x
in the sense that for any two samples x, x′ , we have the likelihood ratio bound

p(zi | x, z1i−1 )
≤ eε .
p(zi | x′ , z1i−1 )

Let us control the amount of “information” (in the form of an updated log-likelihood ratio) released
by this sequential mechanism. Fix x, x′ , and define
p(z1 , . . . , zn | x)
L(z1 , . . . , zn ) := log .
p(z1 , . . . , zn | x′ )

(a) Show that, assuming the data Zi are drawn conditional on x,

t2
 
ε
P (L(Z1 , . . . , Zn ) ≥ nε(e − 1) + t) ≤ exp − .
2nε2

Equivalently, show that


 p 
P L(Z1 , . . . , Zn ) ≥ nε(eε − 1) + ε 2n log(1/δ) ≤ δ.

(b) Let γ ∈ (0, 1). Give the largest value of ε you can that is sufficient to guarantee that for any
test Ψ : Z n → {x, x′ }, we have

Px (Ψ(Z1n ) ̸= x) + Px′ (Ψ(Z1n ) ̸= x′ ) ≥ 1 − γ,

where Px and Px′ denote the sampling distribution of Z1n under x and x′ , respectively?

96
Lexture Notes on Statistics and Information Theory John Duchi

Exercise 4.5 (Marcinkiewicz-Zygmund inequality): Let Xi be independent random variables


with E[Xi ] = 0 and E[|Xi |p ] < ∞, where 1 ≤ p < ∞. Prove that
" n # " n p/2 #
X p X
E Xi ≤ Cp E |Xi |2
i=1 i=1

where Cp is a constant depending only on p. As a corollary, derive that if E[|Xi |p ] ≤ σ p and p ≥ 2,


then
n
" #
p
1X σp
E Xi ≤ Cp p/2 .
n n
i=1
That is, sample means converge quickly to zero in higher moments. Hint: For any fixed x ∈ Rn , if
εi are i.i.d. uniform signs εi ∈ {±1}, then εT x is sub-Gaussian.
Exercise 4.6 (A vector Marcinkiewicz-Zygmund inequality): Let Xi ∈ Rd be independent vectors
with E[Xi ] = 0 and E[∥Xi ∥p2 ] < ∞, where 1 ≤ p < ∞. Prove that
" n # " n p/2 #
X p X 2
E Xi ≤ Cp E ∥Xi ∥2
i=1 2 i=1

where Cp is a constant depending only on p.


Exercise 4.7 (Small balls and anti-concentration): Let X be a nonnegative random variable
satisfying P(X ≤ ϵ) ≤ cϵ for some c < ∞ and all ϵ > 0. Argue that if Xi are i.i.d. copies of X, then
n
!
1X
P Xi ≥ t ≥ 1 − exp(−2n [1/2 − 2ct]2+ )
n
i=1

for all t.
Exercise 4.8 (Lipschitz functions remain sub-Gaussian): Let X be σ 2 -sub-Gaussian and f :
R → R be L-Lipschitz, meaning that |f (x) − f (y)| ≤ L|x − y| for all x, y. Prove that there exists a
numerical constant C < ∞ such that f (X) is CL2 σ 2 -sub-Gaussian.
Exercise 4.9 (Sub-gaussian maxima): Let X1 , . . . , Xn be σ 2 -sub-gaussian (not necessarily inde-
pendent) random variables. Show that
p
(a) E[maxi Xi ] ≤ 2σ 2 log n.
(b) There exists a numerical constant C < ∞ such that E[maxi |Xi |p ] ≤ (Cpσ 2 log k)p/2 .

Exercise 4.10: Let Z ∼ N(0, 1).


(a) Use the Cauchy-Schwarz inequality to show that
2λ2
 
E exp(λ((µ + Z) − µ )) ≤ exp 4λ2 µ2 + λ +
2 2
 
.
[1 − 4|λ|]+

(b) Use a direct integration argument, as in Examples 4.1.12 and 4.1.13, to show that
2λ2
 
 2
 2 1
E exp(λ(µ + Z) ) ≤ exp λµ + − log(1 − 2λ)
1 − 2λ 2
for λ < 21 . Use this to prove the first part of Corollary 4.1.19.

97
Lexture Notes on Statistics and Information Theory John Duchi

x2
Hint. It may be useful to use that − log(1 − x) ≤ −x + 2[1−|x|]+ for all x ∈ R.

Exercise 4.11: Let Z ∼ N(0, Id ), and let A ∈ Rn×d and b ∈ Rn be otherwise arbitrary. Using
the first part of Corollary 4.1.19, show the second part of Corollary 4.1.19, that is, that ∥AZ − b∥22
is (4(∥A∥2Fr + 2 ∥b∥22 ), 4 ∥A∥2op )-sub-exponential. Hint. Use the singular value decomposition A =
U ΓV ⊤ of A, and note that V ⊤ Z ∼ N(0, Id ). Then
Exercise 4.12 (Sub-Gaussian constants of Bernoulli random variables): In this exercise, we will
derive sharp sub-Gaussian constants for Bernoulli random variables (cf. [112, Thm. 1] or [125, 24]),
showing
1 − 2p 2
log E[et(X−p) ] ≤ t for all t ≥ 0. (4.6.1)
4 log 1−p
p

(a) Define φ(t) = log(E[et(X−p) ]) = log((1 − p)e−tp + pet(1−p) ). Show that

φ′ (t) = E[Yt ] and φ′′ (t) = Var(Yt )

pet(1−p)
where Yt = (1 − p) with probability q(t) := pet(1−p) +(1−p)e−tp
and Yt = −p otherwise.

(b) Show that φ′ (0) = 0 and that if p > 1


2, then Var(Yt ) ≤ Var(Y0 ) = p(1 − p). Conclude that
φ(t) ≤ p(1−p)
2 t2 for all t ≥ 0.
1−2p 1+δ
(c) Argue that p(1 − p) ≤ for p ∈ [0, 1]. Hint: Let p = for δ ∈ [0, 1], so that the
2 log 1−p
p
2
1+δ 2δ
Rδ 1
inequality is equivalent to log 1−δ ≤ 1−δ 2 . Then use that log(1 + δ) = 0 1+u du.

(d) Let C = 2 log 1−p 1−p


p and define s = Ct = 2 log p s, and let

1 − 2p 2
f (s) = Cs + Cps − log(1 − p + peCs ),
2
so that inequality (4.6.1) holds if and only if f (s) ≥ 0 for all s ≥ 0. Give f ′ (s) and f ′′ (s).

(e) Show that f (0) = f (1) = f ′ (0) = f ′ (1) = 0, and argue that f ′′ (s) changes signs at most twice
and that f ′′ (0) = f ′′ (1) > 0. Use this to show that f (s) ≥ 0 for all s ≥ 0.

JCD Comment: Perhaps use transportation inequalities to prove this bound, and
also maybe give Ordentlich and Weinberger’s “A Distribution Dependent Refinement
of Pinsker’s Inequality” as an exercise.
1−2p
Exercise 4.13: Let s(p) = . Show that s is concave on [0, 1].
log 1−p
p

Exercise 4.14 (Inner products on complex matrices): Recall that ⟨·, ·⟩ is a complex inner product
on a vector space V if it satisfies the following for all x, y, z ∈ V :

(i) ⟨x, x⟩ ≥ 0, with ⟨x, x⟩ = 0 if and only if x = 0.

(ii) It is conjugate symmetric, so that ⟨x, y⟩ = ⟨y, x⟩.

(iii) It is conjugate linear in its first argument, so that ⟨αx + y, z⟩ = α⟨x, z⟩ + ⟨y, z⟩ for all α ∈ C.

98
Lexture Notes on Statistics and Information Theory John Duchi

The vector space V is real with real inner product if property (ii) is replaced with the symmetry
⟨x, y⟩ = ⟨y, x⟩ and linearity (iii) holds for α ∈ R.

(a) Show that the space of complex m × n matrices Cm×n has complex inner product ⟨A, B⟩ :=
tr(A∗ B).

(b) Show that the space Hn of n × n Hermitian matrices is a real vector space with inner product
⟨A, B⟩ := tr(A∗ B), and that consequently ⟨A, B⟩ ∈ R.

Exercise 4.15 (The Lie product formula): Let A and B be symmetric (or Hermitian) matrices.

(a) Prove the Lie product formula (4.4.1), that is,

lim (exp(A/n) exp(B/n))n = exp(A) exp(B).


n→∞

Hint. One argument proceeds as follows. Let O(ϵ) denote a matrix E such that ∥E∥op ≲ ϵ.
First, demonstrate that
1
eA/n = I + A + O(n−2 ).
n
Then show that for any matrix A, we have (I + n−1 A + o(n−1 ))n → exp(A). Combine these.

(b) Give an example of matrices A and B that do not commute and for which exp(A + B) ̸=
exp(A) exp(B).

Exercise 4.16: Define the trace exponential function f (X) := tr(eX ) on the Hermitian matrices.

(a) Prove that f is monotone for the semidefinite order, that is, if X ⪯ Y , then f (X) ≤ f (Y ).
Hint. It is enough to show that for any A ⪰ 0, the one-dimensional function h(t) := f (X + tA)
is monotone in t, or even that h′ (0) ≥ 0.

(b) Prove that the trace exponential is convex on the Hermitian matrices, that is, f (X) := tr(eX )
is convex. Hint. It is enough to show that for any X, V Hermitian that h(t) := f (X + tV ) is
convex in t, for which it in turn suffices to show that h′′ (0) ≥ 0.

Exercise 4.17 (The matrix-Bernstein inequality): In this question, we prove Theorem 4.3.6.

(a) Let Xi ∈ Hd be independent Hermitian matrices and Sn = ni=1 Xi . Use the Golden-Thompson
P
inequality (Proposition 4.3.2) to show that for all λ ∈ R,
n
Y
tr(E[eλSn ]) ≤ d E[eλXi ] .
op
i=1

(b) Extend Example 4.1.14 to the matrix-valued case. Demonstrate that if X is a mean-zero
Hermitian random matrix with ∥X∥op ≤ b, then for all |λ| < 3b ,

1 λ2 E[X 2 ]
E[exp(λX)] ⪯ I + .
1 − b|λ|/3 2

99
Lexture Notes on Statistics and Information Theory John Duchi

(c) Use parts (a) and (b) to show that


n
!
X
tr(E[eλSn ]) ≤ d exp λ2 σi2
i=1

3
for |λ| ≤ 2b .

(d) Prove Theorem 4.3.6.

Exercise 4.18: In this question, we use Lieb’s concavity inequality (4.5.1) to obtain a stronger
P X1 , . . . , Xn ∈ Hd be an independent sequence of d × d mean-zero
matrix Hoeffding inequality. Let
Hermitian matrices. Let Sn = ni=1 Xi be their sum.
(a) Let X be a random Hermitian matrix and X 2 ⪯ A2 . Show that
λ2 2
log E[eλεX | X] ⪯ A .
2
Hint. Use that the matrix logarithm is operator monotone, that is, if A ⪯ B, then log A ⪯ log B.

(b) Assume that Xn2 ⪯ A2n . Show that

E[tr(exp(λSn ))] ≤ E tr exp λSn−1 + 2λ2 A2n .


 

(c) Show that if Xi2 ⪯ A2i for each i, then


n
!
X
E[tr(exp(λSn ))] ≤ exp 2λ2 A2i .
i=1

(d) Show that if σ 2 ≥ ∥ ni=1 A2i ∥op , then for t ≥ 0,


P

t2
 
P(λmax (Sn ) ≥ t) ≤ d exp − 2 .

(e) Give an example of random Hermitian matrices where the preceding bound is much sharper
than Corollary 4.3.5.

Exercise 4.19: In this question, we use Lieb’s concavity inequality (4.5.1) to demonstrate a
sharper matrix Bernstein inequality than Theorem 4.3.6.
(a) Define the matrix cumulant generating function ϕXi (λ) := log E[exp(λXi )]. Show that
n
!
X
E[tr(exp(λSn ))] ≤ tr exp ϕXi (λ) .
i=1

(b) Using Exercise 4.17 part (b), show that if E[Xi ] = 0, E[Xi2 ] ⪯ Σi , and ∥Xi ∥op ≤ b for each i,
then for |λ| < 3b , we have
n
λ2 2
 
λSn 1 X
E[tr(e )] ≤ d exp σ where σ 2 := Σi .
1 − b|λ|/3 2
i=1 op

100
Lexture Notes on Statistics and Information Theory John Duchi

(c) Show that there exists a numerical constant c > 0 such that for all t ≥ 0,
  2 
  t t
P ∥Sn ∥op ≥ t ≤ 2d exp −c min , .
σ2 b

Why is this sharper than Theorem 4.3.6?

101
Chapter 5

Estimation and generalization

5.1 Uniformity and metric entropy


Now that we have explored a variety of concentration inequalities, we show how to put them to use
in demonstrating that a variety of estimation, learning, and other types of procedures have nice
convergence properties. We first give a somewhat general collection of results, then delve deeper
by focusing on some standard tasks from machine learning.

5.1.1 Symmetrization and uniform laws


The first set of results we consider are uniform laws of large numbers, where the goal is to bound
means uniformly over different classes of functions. Frequently, such results are called Glivenko-
Cantelli laws, after the original Glivenko-Cantelli theorem, which shows that empirical distributions
uniformly converge. We revisit these ideas in the next chapter, where we present a number of more
advanced techniques based on ideas of metric entropy (or volume-like considerations); here we
present the basic ideas using our stability and bounded differencing tools.
The starting point is to define what we mean by a uniform law of large numbers. To do so, we
adopt notation (as in Example 4.2.8) we will use throughout the remainder of the book, reminding
readers as we go. For a sample X1 , . . . , Xn on a space X , we let
n
1X
Pn := 1Xi
n
i=1

denote the empirical distribution on {Xi }ni=1 , where 1Xi denotes the point mass at Xi . Then for
functions f : X → R (or more generally, any function f defined on X ), we let
n
1X
Pn f := EPn [f (X)] = f (Xi )
n
i=1

denote the empirical expectation of f evaluated on the sample, and we also let
Z
P f := EP [f (X)] = f (x)dP (x)

denote general expectations under a measure P . With this notation, we study uniform laws of
large numbers, which consist of proving results of the form
sup |Pn f − P f | → 0, (5.1.1)
f ∈F

102
Lexture Notes on Statistics and Information Theory John Duchi

where convergence is in probability, expectation, almost surely, or with rates of convergence. When
we view Pn and P as (infinite-dimensional) vectors on the space of maps from F → R, then we
may define the (semi)norm ∥·∥F for any L : F → R by
∥L∥F := sup |L(f )|,
f ∈F

in which case Eq. (5.1.1) is equivalent to proving


∥Pn − P ∥F → 0.
Thus, roughly, we are simply asking questions about when random vectors converge to their expec-
tations.1
The starting point of this investigation considers bounded random functions, that is, F consists
of functions f : X → [a, b] for some −∞ < a ≤ b < ∞. In this case, the bounded differences
inequality (Proposition 4.2.5) immediately implies that expectations of ∥Pn − P ∥F provide strong
guarantees on concentration of ∥Pn − P ∥F .
Proposition 5.1.1. Let F be as above. Then
2nt2
 
P (∥Pn − P ∥F ≥ E[∥Pn − P ∥F ] + t) ≤ exp − for t ≥ 0.
(b − a)2
Proof Let Pn and Pn′ be two empirical distributions, differing only in observation i (with Xi and
Xi′ ). We observe that
sup |Pn f − P f | − sup |Pn′ f − P f | ≤ sup |Pn f − P f | − |Pn′ f − P f |

f ∈F f ∈F f ∈F
1 b−a
≤ sup |f (Xi ) − f (Xi′ )| ≤
n f ∈F n

by the triangle inequality. An entirely parallel argument gives the converse lower bound of − b−a
n ,
and thus Proposition 4.2.5 gives the result.

Proposition 5.1.1 shows that, to provide control over high-probability concentration of ∥Pn − P ∥F ,
it is (at least in cases where F is bounded) sufficient to control the expectation E[∥Pn − P ∥F ]. We
take this approach through the remainder of this section, developing tools to simplify bounding
this quantity.
Our starting points consist of a few inequalities relating expectations to symmetrized quantities,
which are frequently easier to control than their non-symmetrized parts. This symmetrization
technique is widely used in probability theory, theoretical statistics, and machine learning. The key
is that for centered random variables, symmetrized quantities have, to within numerical constants,
similar expectations to their non-symmetrized counterparts. Thus, in many cases, it is equivalent
to analyze the symmetized quantity and the initial quantity.
Proposition 5.1.2. Let Xi be independent random vectors on a (Banach) space with norm ∥·∥
and let εi {−1, 1} be independent random signs. Then for any p ≥ 1,
" n # " n # " n #
X p X p X p
2−p E εi (Xi − E[Xi ]) ≤E (Xi − E[Xi ]) ≤ 2p E εi Xi
i=1 i=1 i=1
1
Some readers may worry about measurability issues here. All of our applications will be in separable spaces,
so that we may take suprema with abandon without worrying about measurability, and consequently we ignore this
from now on.

103
Lexture Notes on Statistics and Information Theory John Duchi

In the proof of the upper bound, we could also show the bound
" n # " n #
X p X p
E (Xi − E[Xi ]) ≤ 2p E εi (Xi − E[Xi ]) ,
i=1 i=1

so we may analyze whichever is more convenient.


Proof We prove the right bound first. We introduce independent copies of the Xi and use
these to symmetrize the quantity. Indeed, let Xi′ be an independent copy of Xi , and use Jensen’s
inequality and the convexity of ∥·∥p to observe that
" n # " n # " n #
X p X p X p
′ ′
E (Xi − E[Xi ]) =E (Xi − E[Xi ]) ≤E (Xi − Xi ) .
i=1 i=1 i=1

dist
Now, note that the distribution of Xi − Xi′ is symmetric, so that Xi − Xi′ = εi (Xi − Xi′ ), and thus
" n # " n #
X p X p
E (Xi − E[Xi ]) ≤E εi (Xi − Xi′ ) .
i=1 i=1

Multiplying and dividing by 2p , Jensen’s inequality then gives


" n # " n
#
p p
X 1 X
E (Xi − E[Xi ]) ≤ 2p E εi (Xi − Xi′ )
2
i=1 i=1
" " n # " n ##
X p X p
≤ 2p−1 E εi Xi +E εi Xi′
i=1 i=1

as desired.
For the left bound in the proposition, let Yi = Xi − E[Xi ] be the centered version of the random
variables. We break the sum over random variables into two parts, conditional on whether εi = ±1,
using repeated conditioning. We have
" n # " #
X p X X p
E εi Yi =E Yi − Yi
i=1 i:εi =1 i:ε=−1
" " # " ##
X p X p
≤ E 2p−1 E Yi | ε + 2p−1 E Yi |ε
i:εi =1 i:εi −1
" " # " ##
X X p X X p
p−1
=2 E E Yi + E[Yi ] |ε +E Yi + E[Yi ] |ε
i:εi =1 i:εi =−1 i:εi =−1 i:εi =1
" " # " ##
X X p X X p
≤ 2p−1 E E Yi + Yi |ε +E Yi + Yi |ε
i:εi =1 i:εi =−1 i:εi =−1 i:εi =1
n
" #
X p
= 2p E Yi .
i=1

104
Lexture Notes on Statistics and Information Theory John Duchi

We obtain as an immediate corollary a symmetrization bound for supremum norms on function


spaces. In this corollary, we use the symmetrized empirical measure
n n
1X 1X
Pn0 := εi 1Xi , Pn0 f = εi f (Xi ).
n n
i=1 i=1

The expectation of Pn0 F is of course the Rademacher complexity (Examples 4.2.7 and 4.2.8), and
we have the following corollary.
Corollary 5.1.3. Let F be a class of functions f : X → R and Xi be i.i.d. Then E[∥Pn − P ∥F ] ≤
2E[∥Pn0 ∥F ].
From Corollary 5.1.3, it is evident that by controlling the expectation of the symmetrized process
E[∥Pn0 ∥F ] we can derive concentration inequalities and uniform laws of large numbers. For example,
we immediately obtain that
2nt2
 
0

P ∥Pn − P ∥F ≥ 2E[∥Pn ∥F ] + t ≤ exp −
(b − a)2
for all t ≥ 0 whenever F consists of functions f : X → [a, b].
There are numerous examples of uniform laws of large numbers, many of which reduce to
developing bounds on the expectation E[∥Pn0 ∥F ], which is frequently possible via more advanced
techniques we develop in Chapter 7. A frequent application of these symmetrization ideas is to
risk minimization problems, as we discuss in the coming section; for these, it will be useful for us
to develop a few analytic and calculus tools. To better match the development of these ideas, we
return to the notation of Rademacher complexities, so that Rn (F) := E[ Pn0 F ]. The first is a
standard result, which we state for its historical value and the simplicity of its proof.
Proposition 5.1.4 (Massart’s finite class bound). Let F be any collection of functions with f :
X → R, and assume that σn2 := n−1 E[maxf ∈F ni=1 f (Xi )2 ] < ∞. Then
P

p
2σn2 log |F|
Rn (F) ≤ √ .
n
Proof For each fixed xn1 , the random variable ni=1 εi f (xi ) is ni=1 f (xi )2 -sub-Gaussian. Now,
P P
σ 2 (xn1 ) := n−1 maxf ∈F ni=1 f (xi )2 . Using the results of Exercise 4.9, that is, that E[maxj≤n Zj ] ≤
P
define
p
2σ 2 log n if the Zj are each σ 2 -sub-Gaussian, we see that
p
n 2σ 2 (xn1 ) log |F|
Rn (F | x1 ) ≤ √ .
n
√ p
Jensen’s inequality that E[ ·] ≤ E[·] gives the result.

A refinement of Massart’s finite class bound applies when the classes are infinite but, on a
collection X1 , . . . , Xn , the functions f ∈ F may take on only a (smaller) number of values. In this
case, we define the empirical shatter coefficient of a collection of points x1 , . . . , xn by SF (xn1 ) :=
card{(f (x1 ), . . . , f (xn )) | f ∈ F }, the number of distinct vectors of values (f (x1 ), . . . , f (xn )) the
functions f ∈ F may take. The shatter coefficient is the maximum of the empirical shatter coeffi-
cients over xn1 ∈ X n , that is, SF (n) := supxn1 SF (xn1 ). It is clear that SF (n) ≤ |F| always, but by
only counting distinct values, we have the following corollary.

105
Lexture Notes on Statistics and Information Theory John Duchi

Corollary 5.1.5 (A sharper variant of Massart’s finite class bound). Let F be any collection of
functions with f : X → R, and assume that σn2 := n−1 E[maxf ∈F ni=1 f (Xi )2 ] < ∞. Then
P

p
2σn2 log SF (n)
Rn (F) ≤ √ .
n

Typical classes with small shatter coefficients include Vapnik-Chervonenkis classes of functions; we
do not discuss these further here, instead referring to one of the many books in machine learning
and empirical process theory in statistics.
The most important of the calculus rules we use are the comparison inequalities for Rademacher
sums, which allow us to consider compositions of function classes and maintain small complexity
measurers. We state the rule here; the proof is complex, so we defer it to Section 4.4.3

Theorem 5.1.6 (Ledoux-Talagrand Contraction). Let T ⊂ Rn be an arbitrary set and let ϕi : R →


R be 1-Lipschitz and satisfy ϕi (0) = 0. Then for any nondecreasing convex function Φ : R → R+ ,
n
" !#   
1 X
E Φ sup ϕi (ti )εi ≤ E Φ sup⟨t, ε⟩ .
2 t∈T t∈T
i=1

A corollary to this theorem is suggestive of its power and applicability. Let ϕ : R → R be


L-Lipschitz, and for a function class F define ϕ ◦ F = {ϕ ◦ f | f ∈ F}. Then we have the following
corollary about Rademacher complexities of contractive mappings.

Corollary 5.1.7. Let F be an arbitrary function class and ϕ be L-Lipschitz. Then



Rn (ϕ ◦ F) ≤ 2LRn (F) + |ϕ(0)|/ n.

Proof The result is an almost immediate consequence of Theorem 5.1.6; we simply recenter our
functions. Indeed, we have
n n
" #
1 X 1 X
Rn (ϕ ◦ F | xn1 ) = E sup εi (ϕ(f (xi )) − ϕ(0)) + εi ϕ(0)
f ∈F n i=1 n
i=1
n n
" # " #
1X 1X
≤ E sup εi (ϕ(f (xi )) − ϕ(0)) + E εi ϕ(0)
f ∈F n
i=1
n
i=1
|ϕ(0)|
≤ 2LRn (F) + √ ,
n

where the final inequality


Pn follows by Theorem 5.1.6 (as g(·) = ϕ(·) − ϕ(0) is Lipschitz and satisfies

g(0) = 0) and that E[| i=1 εi |] ≤ n.

5.1.2 Metric entropy, coverings, and packings


When the class of functions F under consideration is finite, the union bound more or less provides
guarantees that Pn f is uniformly close to P f for all f ∈ F. When F is infinite, however, we require
a different set of tools for addressing uniform laws. In many cases, because of the application
of the bounded differences inequality in Proposition 5.1.1, all we really need to do is to control

106
Lexture Notes on Statistics and Information Theory John Duchi

the expectation E[∥Pn0 ∥F ], though the techniques we develop here will have broader use and can
sometimes directly guarantee concentration.
The basic object we wish to control is a measure of the size of the space on which we work.
To that end, we modify notation a bit to simply consider arbitrary vectors θ ∈ Θ, where Θ is a
non-empty set with an associated (semi)metric ρ. For many purposes in estimation (and in our
optimality results in the further parts of the book), a natural way to measure the size of the set is
via the number of balls of a fixed radius δ > 0 required to cover it.

Definition 5.1 (Covering number). Let Θ be a set with (semi)metric ρ. A δ-cover of the set Θ with
respect to ρ is a set {θ1 , . . . , θN } such that for any point θ ∈ Θ, there exists some v ∈ {1, . . . , N }
such that ρ(θ, θv ) ≤ δ. The δ-covering number of Θ is

N (δ, Θ, ρ) := inf {N ∈ N : there exists a δ-cover θ1 , . . . , θN of Θ} .

The metric entropy of the set Θ is simply the logarithm of its covering number log N (δ, Θ, ρ).
We can define a related measure—more useful for constructing our lower bounds—of size that
relates to the number of disjoint balls of radius δ > 0 that can be placed into the set Θ.

Definition 5.2 (Packing number). A δ-packing of the set Θ with respect to ρ is a set {θ1 , . . . , θM }
such that for all distinct v, v ′ ∈ {1, . . . , M }, we have ρ(θv , θv′ ) ≥ δ. The δ-packing number of Θ is

M (δ, Θ, ρ) := sup {M ∈ N : there exists a δ-packing θ1 , . . . , θM of Θ} .

Figures 5.1 and 5.2 give examples of (respectively) a covering and a packing of the same set.

Figure 5.1. A δ-covering of the


elliptical set by balls of radius δ.

An exercise in proof by contradiction shows that the packing and covering numbers of a set are
in fact closely related:

Lemma 5.1.8. The packing and covering numbers satisfy the following inequalities:

M (2δ, Θ, ρ) ≤ N (δ, Θ, ρ) ≤ M (δ, Θ, ρ).

We leave derivation of this lemma to Exercise 5.2, noting that it shows that (up to constant factors)
packing and covering numbers have the same scaling in the radius δ. As a simple example, we see

107
Lexture Notes on Statistics and Information Theory John Duchi

Figure 5.2. A δ-packing of the


elliptical set, where balls have ra-
dius δ/2. No balls overlap, and
δ/2 each center of the packing satisfies
∥θv − θv′ ∥ ≥ δ.

δ/2

for any interval [a, b] on the real line that in the usual absolute distance metric, N (δ, [a, b], | · |) ≍
(b − a)/δ.
As one example of the metric entropy, consider a set of functions F with reasonable covering
numbers (metric entropy) in ∥·∥∞ -norm.

Example 5.1.9 (The “standard” covering number guarantee): Let F consist of functions
f : X → [−b, b] and let the metric ρ be ∥f − g∥∞ = supx∈X |f (x) − g(x)|. Then
!
nt2
 
P sup |Pn f − P f | ≥ t ≤ exp − + log N (t/3, F, ∥·∥∞ ) . (5.1.2)
f ∈F 18b2

So as long as the covering numbers N (t, F, ∥·∥∞ ) grow sub-exponentially in t—so that log N (t) ≪
nt2 —we have the (essentially) sub-Gaussian tail bound (5.1.2). Example 5.2.11 gives one typ-
ical case. Indeed, fix a minimal t/3-cover of F in ∥·∥∞ of size N := N (t/3, F, ∥·∥∞ ), call-
ing the covering functions f1 , . . . , fN . Then for any f ∈ F and the function fi satisfying
∥f − fi ∥∞ ≤ t/2, we have
2t
|Pn f − P f | ≤ |Pn f − Pn fi | + |Pn fi − P fi | + |P fi − P f | ≤ |Pn fi − P fi | + .
3
The Azuma-Hoeffding inequality (Theorem 4.2.3) guarantees (by a union bound) that

nt2
   
P max |Pn fi − P fi | ≥ t ≤ exp − 2 + log N .
i≤N 2b

Combine this bound (replacing t with t/3) to obtain inequality (5.1.2). 3

Given the relationships between packing, covering, and size of sets Θ, we would expect there
to be relationships between volume, packing, and covering numbers. This is indeed the case, as we
now demonstrate for arbitrary norm balls in finite dimensions.

Lemma 5.1.10. Let B denote the unit ∥·∥-ball in Rd . Then


 d
2 d
 
1
≤ N (δ, B, ∥·∥) ≤ 1 + .
δ δ

108
Lexture Notes on Statistics and Information Theory John Duchi

Proof We prove the lemma via a volumetric argument. For the lower bound, note that if the
points v1 , . . . , vN are a δ-cover of B, then
N
X
Vol(B) ≤ Vol(δB + vi ) = N Vol(δB) = N Vol(B)δ d .
i=1

In particular, N ≥ δ −d . For the upper bound on N (δ, B, ∥·∥), let V be a δ-packing of B with
maximal cardinality, so that |V| = M (δ, B, ∥·∥) ≥ N (δ, B, ∥·∥) (recall Lemma 5.1.8). Notably, the
collection of δ-balls {δB + vi }M
i=1 cover the ball B (as otherwise, we could put an additional element
in the packing V), and moreover, the balls { 2δ B + vi } are all disjoint by definition of a packing.
Consequently, we find that
 d
δ d
     
δ δ δ
M Vol(B) = M Vol B ≤ Vol B + B = 1 + Vol(B).
2 2 2 2

Rewriting, we obtain
 d 
δ d Vol(B) 2 d
  
2
M (δ, B, ∥·∥) ≤ 1+ = 1+ ,
δ 2 Vol(B) δ

completing the proof.

5.1.3 Application: matrix concentration


Let us give one application of Lemma 5.1.10 to concentration of random matrices; we explore more
in the exercises as well. We can generalize the definition of sub-Gaussian random variables to
sub-Gaussian random vectors, where we say that X ∈ Rd is a σ 2 -sub-Gaussian vector if
 2 
σ 2
E[exp(⟨u, X − E[X]⟩)] ≤ exp ∥u∥2 (5.1.3)
2

for all u ∈ Rd . For example, X ∼ N(0, Id ) is immediately 1-sub-Gaussian, and X ∈ [−b, b]d with
independent entries is b2 -sub-Gaussian. Now, suppose that Xi are independent isotropic random
vectors, meaning that E[Xi ] = 0, E[Xi Xi⊤ ] = Id , and that they are also σ 2 -sub-Gaussian. Then by
an application of Lemma 5.1.10, we can give concentration guarantees for the sample covariance
Σn := n1 ni=1 Xi Xi⊤ for the operator norm ∥A∥op := sup{⟨u, Av⟩ | ∥u∥2 = ∥v∥2 = 1}.
P

Proposition 5.1.11. Let Xi be independent isotropic and σ 2 -sub-Gaussian vectors. Then there is
a numerical constant C such that the sample covariance Σn := n1 ni=1 Xi Xi⊤ satisfies
P

 s 
d + log 1δ d + log 1δ
∥Σn − Id ∥op ≤ Cσ 2  + 
n n

with probability at least 1 − δ.

Proof We begin with an intermediate lemma.

109
Lexture Notes on Statistics and Information Theory John Duchi

Lemma 5.1.12. Let A be symmetric and {ui }N d


i=1 be an ϵ-cover of the unit ℓ2 ball B2 . Then

(1 − 2ϵ) ∥A∥op ≤ max⟨ui , Aui ⟩ ≤ ∥A∥op .


i≤N

Proof The second inequality is trivial. Fix any u ∈ Bd2 . Then for the i such that ∥u − ui ∥2 ≤ ϵ,
we have

⟨u, Au⟩ = ⟨u − ui , Au⟩ + ⟨ui , Au⟩ = 2⟨u − ui , Au⟩ + ⟨ui , Aui ⟩ ≤ 2ϵ ∥A∥op + ⟨ui , Aui ⟩

by definition of the operator norm. Taking a supremum over u gives the final result.

Let the matrix Ei = Xi Xi⊤ − I, and define the average error E n = n1 Ei . Then with this lemma
in hand, we see that for any ϵ-cover N of the ℓ2 -ball Bd2 ,

(1 − 2ϵ) E n op
≤ max⟨u, E n u⟩.
u∈N

Now, note that ⟨u, Ei u⟩ = ⟨u, Xi ⟩2 −∥u∥22 is sub-exponential, as it is certainly mean 0 and, moreover,
is the square of a sub-Gaussian; in particular, Theorem 4.1.15 shows that there is a numerical
constant C < ∞ such that
1
E[exp(λ⟨u, Ei u⟩)] ≤ exp Cλ2 σ 4

for |λ| ≤ .
Cσ 2
1
Taking ϵ = 4 in our covering N , then,
 

P( E n op ≥ t) ≤ P max⟨u, E n u⟩ ≥ t/2 ≤ |N | · max P ⟨u, nE n u⟩ ≥ nt/2
u∈N u∈N

by a union bound. As sums of sub-exponential random variable remain sub-exponential, Corol-


lary 4.1.18 implies
  2 
  nt nt
P E n op ≥ t ≤ |N | exp −c min , ,
σ4 σ2
where c > 0 is a numerical constant. Finally, we apply Lemma 5.1.10, q which guarantees that
d 2 d+log 1δ 2 d+log 1δ
|N | ≤ 9 , and then take t to scale as the maximum of σ n and σ n .

5.2 Generalization bounds


We now build off of our ideas on uniform laws of large numbers and Rademacher complexities to
demonstrate their applications in statistical machine learning problems, focusing on empirical risk
minimization procedures and related problems. We consider a setting as follows: we have a sample
Z1 , . . . , Zn ∈ Z drawn i.i.d. according to some (unknown) distribution P , and we have a collection
of functions F from which we wish to select an f that “fits” the data well, according to some loss
measure ℓ : F × Z → R. That is, we wish to find a function f ∈ F minimizing the risk

L(f ) := EP [ℓ(f, Z)]. (5.2.1)

110
Lexture Notes on Statistics and Information Theory John Duchi

In general, however, we only have access to the risk via the empirical distribution of the Zi , and
we often choose f by minimizing the empirical risk
n
b n (f ) := 1
X
L ℓ(f, Zi ). (5.2.2)
n
i=1

As written, this formulation is quite abstract, so we provide a few examples to make it somewhat
more concrete.
Example 5.2.1 (Binary classification problems): One standard problem—still abstract—
that motivates the formulation (5.2.1) is the binary classification problem. Here the data Zi
come in pairs (X, Y ), where X ∈ X is some set of covariates (independent variables) and
Y ∈ {−1, 1} is the label of example X. The function class F consists of functions f : X → R,
and the goal is to find a function f such that
P(sign(f (X)) ̸= Y )
is small, that is, minimizing the risk E[ℓ(f, Z)] where the loss is the 0-1 loss, ℓ(f, (x, y)) =
1 {f (x)y ≤ 0}. 3

Example 5.2.2 (Multiclass classification): The multiclass classifcation problem is identical


to the binary problem, but instead of Y ∈ {−1, 1} we assume that Y ∈ [k] = {1, . . . , k} for
some k ≥ 2, and the function class F consists of (a subset of) functions f : X → Rk . The
goal is to find a function f such that, if Y = y is the correct label for a datapoint x, then
fy (x) > fl (x) for all l ̸= y. That is, we wish to find f ∈ F minimizing
P (∃ l ̸= Y such that fl (X) ≥ fY (X)) .
In this case, the loss function is the zero-one loss ℓ(f, (x, y)) = 1 {maxl̸=y fl (x) ≥ fy (x)}. 3

Example 5.2.3 (Binary classification with linear functions): In the standard statistical
learning setting, the data x belong to Rd , and we assume that our function class F is indexed
by a set Θ ⊂ Rd , so that F = {fθ : fθ (x) = θ⊤ x, θ ∈ Θ}. In this case, we may use the zero-one
loss,
 the convex hinge loss, or the (convex) logistic loss, which are variously ℓzo (fθ , (x, y)) :=
1 yθ⊤ x ≤ 0 , and the convex losses
h i
ℓhinge (fθ , (x, y)) = 1 − yx⊤ θ and ℓlogit (fθ , (x, y)) = log(1 + exp(−yx⊤ θ)).
+

The hinge and logistic losses, as they are convex, are substantially computationally easier to
work with, and they are common choices in applications. 3

The main motivating question that we ask is the following: given a sample Z1 , . . . , Zn , if we
choose some fbn ∈ F based on this sample, can we guarantee that it generalizes to unseen data? In
particular, can we guarantee that (with high probability) we have the empirical risk bound
n
b n (fbn ) = 1
X
L ℓ(fbn , Zi ) ≤ R(fbn ) + ϵ (5.2.3)
n
i=1

for some small ϵ? If we allow fbn to be arbitrary, then this becomes clearly impossible: consider the
classification example 5.2.1, and set fbn to be the “hash” function that sets fbn (x) = y if the pair
(x, y) was in the sample, and otherwise fbn (x) = −1. Then clearly L b n (fbn ) = 0, while there is no
useful bound on R(fn ).
b

111
Lexture Notes on Statistics and Information Theory John Duchi

5.2.1 Finite and countable classes of functions


In order to get bounds of the form (5.2.3), we require a few assumptions that are not too onerous.
First, throughout this section, we will assume that for any fixed function f , the loss ℓ(f, Z) is
σ 2 -sub-Gaussian, that is,
 2 2
λ σ
EP [exp (λ(ℓ(f, Z) − L(f )))] ≤ exp (5.2.4)
2
for all f ∈ F. (Recall that the risk functional L(f ) = EP [ℓ(f, Z)].) For example, if the loss is the
zero-one loss from classification problems, inequality (5.2.4) is satisfied with σ 2 = 14 by Hoeffding’s
lemma. In order to guarantee a bound of the form (5.2.4) for a function fb chosen dependent on
the data, in this section we give uniform bounds, that is, we would like to bound
!
 
P there exists f ∈ F s.t. L(f ) > Lb n (f ) + t or P sup L b n (f ) − R(f ) > t .
f ∈F

Such uniform bounds are certainly sufficient to guarantee that the empirical risk is a good proxy
for the true risk L, even when fbn is chosen based on the data.
Now, recalling that our set of functions or predictors F is finite or countable, let us suppose
that for each f ∈ F, we have a complexity measure c(f )—a penalty—such that
X
e−c(f ) ≤ 1. (5.2.5)
f ∈F

This inequality should look familiar to the Kraft inequality—which we will see in the coming
chapters—from coding theory. As soon as we have such a penalty function, however, we have the
following result.
Theorem 5.2.4. Let the loss ℓ, distribution P on Z, and function class F be such that ℓ(f, Z) is
σ 2 -sub-Gaussian for each f ∈ F, and assume that the complexity inequality (5.2.5) holds. Then
with probability at least 1 − δ over the sample Z1:n ,
s
1
b n (f ) + 2σ 2 log δ + c(f ) for all f ∈ F.
L(f ) ≤ L
n
Proof First, we note that by the usual sub-Gaussian concentration inequality (Corollary 4.1.10)
we have for any t ≥ 0 and any f ∈ F that
nt2
   
P L(f ) ≥ Ln (f ) + t ≤ exp − 2 .
b

p
Now, if we replace t by t2 + 2σ 2 c(f )/n, we obtain
2
 
 p
2 2
 nt
b n (f ) + t + 2σ c(f )/n ≤ exp −
P L(f ) ≥ L − c(f ) .
2σ 2
Then using a union bound, we have
nt2
  X  
p
2 2
P ∃ f ∈ F s.t. L(f ) ≥ Ln (f ) + t + 2σ c(f )/n ≤
b exp − 2 − c(f )

f ∈F
nt2 X
 
= exp − 2 exp(−c(f )) .

f ∈F
| {z }
≤1

112
Lexture Notes on Statistics and Information Theory John Duchi

Setting t2 = 2σ 2 log 1δ /n gives the result.

As one classical example of this setting, suppose that we have a finite class of functions F. Then
we can set c(f ) = log |F|, in which case we clearly have the summation guarantee (5.2.5), and we
obtain s
1
L(f ) ≤ Lb n (f ) + 2σ 2 log δ + log |F| uniformly for f ∈ F
n
with probability at least 1 − δ. To make this even more concrete, consider the following example.

Example 5.2.5 (Floating point classifiers): We implement a linear binary classifier using
double-precision floating point values, that is, we have fθ (x) = θ⊤ x for all θ ∈ Rd that may
be represented using d double-precision floating point numbers. Then for each coordinate of
θ, there are at most 264 representable numbers;
 ⊤ in total, we must thus have |F| ≤ 264d . Thus,
for the zero-one loss ℓzo (fθ , (x, y)) = 1 θ xy ≤ 0 , we have
s
1
b n (fθ ) + log δ + 45d
L(fθ ) ≤ L
2n
for all representable classifiers simultaneously, with probability at least 1 − δ, as the zero-one
loss is 1/4-sub-Gaussian. (Here we have used that 64 log 2 < 45.) 3

We also note in passing that by replacing δ with δ/2 in the bounds of Theorem 5.2.4, a union
bound yields the following two-sided corollary.

Corollary 5.2.6. Under the conditions of Theorem 5.2.4, we have


s
2
Lb n (f ) − L(f ) ≤ 2σ 2 log δ + c(f ) for all f ∈ F
n
with probability at least 1 − δ.

5.2.2 Large classes


When the collection of functions is (uncountably) infinite, it can be more challenging to obtain
strong generalization bounds, though there still exist numerous tools for these ideas. The most
basic, of which we will give examples, leverage covering number bounds (essentially, as in Exam-
ple 5.1.9). We return in the next chapter to alternative approaches based on randomization and
divergence measures, which provide guarantees with somewhat similar structure to those we present
here.
Let us begin by considering a few examples, after which we provide examples showing how to
derive explicit bounds using Rademacher complexities.

Example 5.2.7 (Rademacher complexity of the ℓ2 -ball): Let Θ = {θ ∈ Rd | ∥θ∥2 ≤ r}, and
consider the class of linear functionals F := {fθ (x) = θT x, θ ∈ Θ}. Then
v
u n
r uX
n
Rn (F | x1 ) ≤ t ∥xi ∥22 ,
n
i=1

113
Lexture Notes on Statistics and Information Theory John Duchi

because we have
v " v
n n u n
" # #
u 2
r X ru X ru X
Rn (F | xn1 ) = E εi x i ≤ t E εi x i = t ∥xi ∥22 ,
n 2 n 2 n
i=1 i=1 i=1

as desired. 3

In high-dimensional situations, it is sometimes useful to consider more restrictive function


classes, for example, those indexed by vectors in an ℓ1 -ball.

Example 5.2.8 (Rademacher complexity of the ℓ1 -ball): In contrast to the previous example,
suppose that Θ = {θ ∈ Rd | ∥θ∥1 ≤ r}, and consider the linear class F := {fθ (x) = θT x, θ ∈ Θ}.
Then
" n #
r X
Rn (F | xn1 ) = E εi x i .
n ∞ i=1

Now, each coordinate j of ni=1 εi xi is ni=1 x2ij -sub-Gaussian, and thus using that E[maxj≤d Zj ] ≤
P P
p
2σ 2 log d for arbitrary σ 2 -sub-Gaussian Zj (see Exercise 4.9), we have
v
u n
n r u X
Rn (F | x1 ) ≤ t2 log(2d) max x2ij .
n j
i=1

To facilitate comparison with Example 5.2.8, suppose that the vectors


p xi all satisfy ∥xi ∥∞ ≤ b.
n √
√ Rn (F | x1 ) ≤ rb 2 log(2d)/ n. In contrast,
In this case, the preceding inequality implies that
the ℓ2 -norm √ of such xi may satisfy ∥xi ∥2 = b d, so that the bounds of Example 5.2.7 scale

instead as rb d/ n, which can be exponentially larger. 3

These examples are sufficient to derive a few sophisticated risk bounds. We focus on the case
where we have a loss function applied to some class with reasonable Rademacher complexity, in
which case it is possible to recenter the loss class and achieve reasonable complexity bounds. The
coming proposition does precisely this in the case of margin-based binary classification. Consider
points (x, y) ∈ X × {±1}, and let F be an arbitrary class of functions f : X → R and L =
{(x, y) 7→ ℓ(yf (x))}f ∈F be the induced collection of losses. As a typical example, we might have
ℓ(t) = [1 − t]+ , ℓ(t) = e−t , or ℓ(t) = log(1 + e−t ). We have the following proposition.

Proposition 5.2.9. Let F and X be such that supx∈X |f (x)| ≤ M for f ∈ F and assume that
ℓ is L-Lipschitz. Define the empirical and population risks L
b n (f ) := Pn ℓ(Y f (X)) and L(f ) :=
P ℓ(Y f (X)). Then
!
nt2
 
P sup |Ln (f ) − L(f )| ≥ 4LRn (F) + t ≤ 2 exp − 2 2
b for t ≥ 0.
f ∈F 2L M

Proof We may recenter the class L, that is, replace ℓ(·) with ℓ(·) − ℓ(0), without changing
b n (f ) − L(f ). Call this class L0 , so that ∥Pn − P ∥ = ∥Pn − P ∥ . This recentered class satisfies
L L L0
bounded differences with constant 2M L, as |ℓ(yf (x)) − ℓ(y ′ f (x′ ))| ≤ L|yf (x) − y ′ f (x′ )| ≤ 2LM ,
as in the proof of Proposition 5.1.1. Applying Proposition 5.1.1 and then Corollary 5.1.3 and gives

114
Lexture Notes on Statistics and Information Theory John Duchi

b n (f ) − L(f )| ≥ 2Rn (L0 ) + t) ≤ exp(− nt22 2 ) for t ≥ 0. Then applying the con-
that P(supf ∈F |L 2M L
traction inequality (Theorem 5.1.6) yields Rn (L0 ) ≤ 2LRn (F), giving the result.

Let us give a few example applications of these ideas.

Example 5.2.10 (Support vector machines and hinge losses): In the support vector machine
problem, we receive data (Xi , Yi ) ∈ Rd × {±1}, and we seek to minimize average of the losses
ℓ(θ; (x, y)) = 1 − yθT x + . We assume that the space X has ∥x∥2 ≤ b for x ∈ X and that


Θ = {θ ∈ Rd | ∥θ∥2 ≤ r}. Applying Proposition 5.2.9 gives

nt2
   
P sup |Pn ℓ(θ; (X, Y )) − P ℓ(θ; (X, Y ))| ≥ 4Rn (FΘ ) + t ≤ exp − 2 2 ,
θ∈Θ 2r b

where FΘ = {fθ (x) = θT x}θ∈Θ . Now, we apply Example 5.2.7, which implies that

2rb
Rn (ϕ ◦ FΘ ) ≤ 2Rn (Fθ ) ≤ √ .
n

That is, we have

nt2
   
4rb
P sup |Pn ℓ(θ; (X, Y )) − P ℓ(θ; (X, Y ))| ≥ √ + t ≤ exp − ,
θ∈Θ n 2(rb)2

so that Pn and P become close at rate roughly rb/ n in this case. 3

Example 5.2.10 is what is sometimes called a “dimension free” convergence result—there is no


esxplicit dependence on the dimension d of the problem, except as the radii r and b make explicit.
One consequence of this is that if x and θ instead belong to a Hilbert space (potentiall infinite
dimensional) with inner product ⟨·, ·⟩ and norm ∥x∥2 = ⟨x, x⟩, but for which we are guaranteed
that ∥θ∥ ≤ r and similarly ∥x∥ ≤ b, then the result still applies. Extending this to other function
classes is reasonably straightforward, and we present a few examples in the exercises.
When we do not have the simplifying structure of ℓ(yf (x)) identified in the preceding examples,
we can still provide guarantees of generalization using the covering number guarantees introduced
in Section 5.1.2. The most common and important case is when we have a Lipschitzian loss function
in an underlying parameter θ.

Example 5.2.11 (Lipschitz functions over a norm-bounded parameter space): Consider the
parametric loss minimization problem

minimize L(θ) := E[ℓ(θ; Z)]


θ∈Θ

for a loss function ℓ that is M -Lipschitz (with respect to the norm ∥·∥) in its argument, where
for normalization we assume inf θ∈Θ ℓ(θ, z) = 0 for each z. Then the metric entropy of Θ
bounds the metric entropy of the loss class F := {z 7→ ℓ(θ, z)}θ∈Θ for the supremum norm
∥·∥∞ . Indeed, for any pair θ, θ′ , we have

sup |ℓ(θ, z) − ℓ(θ′ , z)| ≤ M θ − θ′ ,


z

115
Lexture Notes on Statistics and Information Theory John Duchi

and so an ϵ-cover of Θ is an M ϵ-cover of F in supremum norm. In particular,

N (ϵ, F, ∥·∥∞ ) ≤ N (ϵ/M, Θ, ∥·∥).

Assume that Θ ⊂ {θ | ∥θ∥ ≤ b} for some finite b. Then Lemma 5.1.10 guarantees that
log N (ϵ, Θ, ∥·∥) ≤ d log(1 + 2/ϵ) ≲ d log 1ϵ , and so the classical covering number argument in
Example 5.1.9 gives
nt2
   
M
P sup |Pn ℓ(θ, Z) − P ℓ(θ, Z)| ≥ t ≤ exp −c 2 2 + Cd log ,
θ∈Θ b M t
2 2d
where c, C are numerical constants. In particular, taking t2 ≍ M nb log nδ gives that

M b d log nδ
p
|Pn ℓ(θ, Z) − P ℓ(θ, Z)| ≲ √
n
with probability at least 1 − δ. 3

5.2.3 Structural risk minimization and adaptivity


In general, for a given function class F, we can always decompose the excess risk into the approxi-
mation/estimation error decomposition. That is, let

L∗ = inf L(f ),
f

where the preceding infimum is taken across all (measurable) functions. Then we have

L(fbn ) − L∗ = L(fbn ) − inf L(f ) + inf L(f ) − L∗ . (5.2.6)


f ∈F f ∈F
| {z } | {z }
estimation approximation
There is often a tradeoff between these two, analogous to the bias/variance tradeoff in classical
statistics; if the approximation error is very small, then it is likely hard to guarantee that the esti-
mation error converges quickly to zero, while certainly a constant function will have low estimation
error, but may have substantial approximation error. With that in mind, we would like to develop
procedures that, rather than simply attaining good performance for the class F, are guaranteed
to trade-off in an appropriate way between the two types of error. This leads us to the idea of
structural risk minimization.
In this scenario, we assume we have a sequence of classes of functions, F1 , F2 , . . ., of increasing
complexity, meaning that F1 ⊂ F2 ⊂ . . .. For example, in a linear classification setting with
vectors x ∈ Rd , we might take a sequence of classes allowing increasing numbers of non-zeros in
the classification vector θ:
n o n o
F1 := fθ (x) = θ⊤ x such that ∥θ∥0 ≤ 1 , F2 := fθ (x) = θ⊤ x such that ∥θ∥0 ≤ 2 , . . . .

More broadly, let {Fk }k∈N be a (possibly infinite) increasing sequence of function classes. We
assume that for each Fk and each n ∈ N, there exists a constant Cn,k (δ) such that we have the
uniform generalization guarantee
!
P sup L b n (f ) − L(f ) ≥ Cn,k (δ) ≤ δ · 2−k .
f ∈Fk

116
Lexture Notes on Statistics and Information Theory John Duchi

For example, by Corollary 5.2.6, if F is finite we may take


s
log |Fk | + log 1δ + k log 2
Cn,k (δ) = 2σ 2 .
n

(We will see in subsequent sections of the course how to obtain other more general guarantees.)
We consider the following structural risk minimization procedure. First, given the empirical
risk L
b n , we find the model collection b
k minimizing the penalized risk
 
k := argmin inf L
b b n (f ) + Cn,k (δ) . (5.2.7a)
k∈N f ∈Fk

We then choose fb to minimize the risk over the estimated “best” class Fbk , that is, set

fb := argmin L
b n (f ). (5.2.7b)
f ∈Fkb

With this procedure, we have the following theorem.

Theorem 5.2.12. Let fb be chosen according to the procedure (5.2.7a)–(5.2.7b). Then with proba-
bility at least 1 − δ, we have

L(fb) ≤ inf inf {L(f ) + 2Cn,k (δ)} .


k∈N f ∈Fk

Proof First, we have by the assumed guarantee on Cn,k (δ) that


!
P ∃ k ∈ N and f ∈ Fk such that sup L
b n (f ) − L(f ) ≥ Cn,k (δ)
f ∈Fk
∞ ∞
!
X X
≤ P ∃ f ∈ Fk such that sup L
b n (f ) − L(f ) ≥ Cn,k (δ) ≤ δ · 2−k = δ.
k=1 f ∈Fk k=1

On the event that supf ∈Fk |L b n (f ) − L(f )| < Cn,k (δ) for all k, which occurs with probability at least
1 − δ, we have
n o
L(fb) ≤ L
b n (f ) + C b (δ) = inf inf L
n,k
b n (f ) + Cn,k (δ) ≤ inf inf {L(f ) + 2Cn,k (δ)}
k∈N f ∈Fk k∈N f ∈Fk

by our choice of fb. This is the desired result.

We conclude with a final example, using our earlier floating point bound from Example 5.2.5,
coupled with Corollary 5.2.6 and Theorem 5.2.12.

Example 5.2.13 (Structural risk minimization with floating point classifiers): Consider
again our floating point example, and let the function class Fk consist of functions defined by
at most k double-precision floating point values, so that log |Fk | ≤ 45d. Then by taking
s
log 1δ + 65k log 2
Cn,k (δ) =
2n

117
Lexture Notes on Statistics and Information Theory John Duchi

we have that |Lb n (f )−L(f )| ≤ Cn,k (δ) simultaneously for all f ∈ Fk and all Fk , with probability
at least 1 − δ. Then the empirical risk minimization procedure (5.2.7) guarantees that
 s 
1
 2 log δ + 91k 
L(fb) ≤ inf inf L(f ) + .
k∈N f ∈Fk n 

Roughly, we trade between small risk L(f )—as the qrisk inf f ∈Fk L(f ) must be decreasing in
k—and the estimation error penalty, which scales as (k + log 1δ )/n. 3

5.3 M-estimators and estimation


In many problems in statistics and machine learning, we seek not just to have small loss on some
future data but to actually recover parameters of interest. For example, in a regression problem
where we model
y = ⟨x, θ⟩ + ε,
we often care about the actual values of θ. Less prosaically, in the latter part of the book we will
develop a number of fundamental limits and lower bounds; it is good to have algorithms and upper
bounds to demonstrate their tightness!
To that end, we here develop representative finite-sample results on the convergence of different
estimators. We focus on M-estimators, meaning those that arise from minimization of a loss function
ℓ(θ, z) convex in θ, where z is problem data. Exponential families, from Chapter 3, provide a natural
examples here.
Example 5.3.1 (Exponential families and log loss): Let pθ (x) = exp(⟨θ, ϕ(x)⟩ − A(θ)) be
the density of an exponential family. Then ℓ(θ, x) := − log pθ (x) = −⟨θ, ϕ(x)⟩ + A(θ) is convex
and C ∞ over Θ := {θ | A(θ) < ∞}. 3

Example 5.3.2 (Logistic regression): In binary logistic regression (Example 3.4.2) with
labels y ∈ {±1}, we have data z = (x, y) ∈ Rd × {±1}, and the loss
 
ℓ(θ, z) = − log pθ (y | x) = log 1 + exp(−yx⊤ θ) ,

which is C ∞ , domain all of Rd , and Lipschitz-continuous derivatives of all orders. (Though the
particular Lipschitz constant depends on x). 3

Regardless, we will consider general rates of convergence and estimation error for estimators
minimizing the empirical loss
n
1X
Ln (θ) := Pn ℓ(θ, Z) = ℓ(θ, Zi ),
n
i=1

which approximates the population loss


L(θ) := EP [ℓ(θ, Z)],
iid
where Zi ∼ P and Pn = n1 ni=1 1Zi is the usual empirical distribution. Thus, for a closed convex
P
set Θ ⊂ Rd , we will study the M-estimator
θbn := argmin Ln (θ), (5.3.1)
θ∈Θ

118
Lexture Notes on Statistics and Information Theory John Duchi

providing prototypical arguments for its convergence. Often, we will take Θ = Rd , though this will
not be essential.
Based on the results in the preceding sections on uniform convergence, one natural idea is to
use uniform convergence: if we can argue that

sup |Ln (θ) − L(θ)| → 0,


θ∈Θ

then so long as the minimizer θ⋆ of L is unique and L(θ) − L(θ⋆ ) grows with the distance ∥θ − θ⋆ ∥,
we necessarily have θbn → θ⋆ . Unfortunately, this naive approach typically fails to achieve the
correct convergence rates, let alone the correct dependence on problem parameters. We therefore
take another approach.
JCD Comment: Will need some figures here / illustrations

Recall that a twice differentiable function L is convex if and only if ∇2 L(θ) ⪰ 0 for all θ ∈ dom L.
We thus expect that L should have some quadratic growth around its minimizer θ⋆ , meaning that
in a neighborhood of θ⋆ , we have
λ
L(θ) ≥ L(θ⋆ ) + ∥θ − θ⋆ ∥22
2
for θ near enough θ⋆ . In such a situation, because the sampled Ln is also convex and approximates
L, we then expect that for parameters θ far enough from θ⋆ that the growth of L(θ) above L(θ⋆ )
dominates the noise inherent in the sampling, we necessarily have Ln (θ) > Ln (θ⋆ ). Because the
empirical minimizer θbn necessarily satisfies Ln (θbn ) ≤ Ln (θ⋆ ), we would never choose such a distant
parameter, thus implying a convergence rate. To make this type of argument rigorous requires a
bit of convex analysis and sampling theory; luckily, we are by now well-equipped to address this.

5.3.1 Standard conditions and convex optimization


To provide relatively clean results, we will consider a collection of loss functions to simplify analysis.
As we will see, the assumed conditions are not too onerous, as many families of losses satisfy
them. We can relax them using some of the more sophisticated concentration inequalities we have
developed. We therefore make the following standing assumption.

Assumption A.5.1 (Standard conditions). For each z ∈ Z, the losses ℓ(θ, z) are convex in θ.
There are constants M0 , M1 , M2 < ∞ such that for each z ∈ Z,

(i) ∥∇ℓ(θ⋆ , z)∥2 ≤ M0

(ii) ∇2 ℓ(θ⋆ , z) op
≤ M1 , and

(iii) the Hessian ∇2 ℓ(θ, z) is M2 -Lipschitz continuous in a neighborhood of radius r > 0 around
θ⋆ , meaning ∇2 ℓ(θ0 , z) − ∇2 ℓ(θ1 , z) op ≤ M0 ∥θ0 − θ1 ∥2 whenever ∥θi − θ⋆ ∥2 ≤ r.

Additionally, the minimizer θ⋆ = argminθ L(θ) exists and for a λ > 0 satisfies

∇2 L(θ⋆ ) ⪰ λI.

119
Lexture Notes on Statistics and Information Theory John Duchi

The “standard conditions” in Assumption A.5.1 are not so onerous. As we see when we spe-
cialize our coming results to exponential family models in Section 5.3.4, Assumption A.5.1 holds
essentially as soon as the family is minimal and ϕ(x) is bounded. The existence of minimizers can
be somewhat more subtle to guarantee than the smoothness conditions (i)–(iii), though these are
typically straightforward. (For more on the existence of minimizers, see Exercise 5.10.)
To quickly highlight the conditions, we revisit binary logistic and robust regression.

Example (Example 5.3.2 continued): For logistic regression with labels y ∈ {±1}, we have
1
∇ℓ(θ, (x, y)) = − yx and ∇2 ℓ(θ, (x, y)) = pθ (y | x)(1 − pθ (y | x))xx⊤ .
1 + eyx⊤ θ
Then Assumptions (i)–(iii) hold so long as supx∈X ∥x∥2 < ∞, with M0 = supx∈X ∥x∥2 and
M1 = 14 M02 . We revisit the existence of minimizers in the sequel, noting that because 0 <
pθ (y | x)(1 − pθ (y | x)) ≤ 14 for any θ, x, y, if a minimizer exists then it is necessarily unique
as soon as E[XX ⊤ ] ≻ 0. 3

Example 5.3.3 (Robust regression): In robust regression, we wish to best approximate


responses y ∈ R via linear functions x⊤ θ, but because of outliers, do not use the squared loss.
Thus, for a smooth symmetric convex function h with bounded derivatives, we take

ℓ(θ, (x, y)) = h(⟨x, θ⟩ − y),

so that
∇ℓ(θ, (x, y)) = h′ (⟨x, θ⟩ − y)x and ∇2 ℓ(θ, (x, y)) = h′′ (⟨x, θ⟩ − y)xx⊤ .
et −1
A prototypical example is h(t) = log(1 + et ) + log(1 + e−t ), which satisfies h′ (t) = et +1 ∈
2et
[−1, 1] and h′′ (t)
= ∈ (0, 1/2]. So long as the covariates x ∈ X have finite radius
(et +1)2
rad2 (X ) := supx∈X ∥x∥2 , we obtain the Lipschitz constant bounds

M0 ≤ rad2 (X ), M1 ≲ rad22 (X ), and M2 ≲ rad32 (X )

for Assumption A.5.1, parts (i)–(iii). In general, if h is symmetric with h′′ (0) > 0, then
minimizers exist whenever Y is non-pathological and E[XX ⊤ ] ≻ 0. Exercise 5.11 asks you to
prove this last claim on existence of minimizers. 3

5.3.2 Some growth properties of convex functions


As we discuss above, we will roughly proceed in our analysis by showing that the growth of the
loss function dominates the noise inherent in sampling. To do so, we will rely on certain growth
properties of convex functions. We collect them here, as they provide the fundamental building
block for convergence analysis.
First, we show that for any convex function h, if there exists a “shell” S = {θ | ∥θ − θ0 ∥2 = r}
around some point θ0 for which h(θ) > h(θ0 ) for all θ ∈ S, then necessarily the minimizer θb =
argminθ h(θ) satisfies ∥θb − θ0 ∥2 < r.
JCD Comment: Figure(s)

Lemma 5.3.4. Let h be convex and θ0 ∈ dom h and v an arbitrary vector. Then for all t ≥ 1,
h(θ0 + tv) − h(θ0 ) ≥ t(h(θ1 ) − h(θ0 )).

120
Lexture Notes on Statistics and Information Theory John Duchi

Proof Let θt = θ0 + tv for t ≥ 1. Then for t ≥ 1, we can write θ1 = 1t θt + (1 − 1t )θ0 , and so


    
1 1 1 1
h(θ1 ) = h 1− θ0 + θt ≤ 1 − h(θ0 ) + h(θt ).
t t t t
Rearranging yields  
1 1
1− (h(θ1 ) − h(θ0 )) ≤ (h(θt ) − h(θ1 )) ,
t t
and multiplying through by t and rearranging implies the desired result.

Extending this to shells, we have the following result.


Lemma 5.3.5. Let h be convex and θ0 ∈ dom h and assume that for some ϵ > 0 and δ > 0 that
h(θ0 + v) ≥ h(θ0 ) + δ for all ∥v∥2 = ϵ. Then for all θ for which ∥θ − θ0 ∥2 ≥ ϵ,
δ
h(θ) − h(θ0 ) ≥ ∥θ − θ0 ∥2 .
ϵ
θ−θ0 ∥θ−θ0 ∥2
Proof For any θ with ∥θ − θ0 ∥2 ≥ ϵ, we can write θ = θ0 + tv for v = ϵ ∥θ−θ 0 ∥2
and t = ϵ .
Apply Lemma 5.3.4 with the substitution δ ≥ h(θ1 ) − h(θ0 ).

Now we connect these results to the growth of suitably smooth convex functions. Here, we wish
to argue that the minimizer of a convex function h is not too far from a benchmark point θ0 at
which h has strong upward curvature and small gradient.
JCD Comment: Figure.

Lemma 5.3.6. Let h be convex and λ > 0, γ ≥ 0, and ϵ ≥ 2γ λ . Assume that for some θ0 , we
have ∥∇h(θ0 )∥2 ≤ γ and ∇2 h(θ) ⪰ λI for all θ satisfying ∥θ − θ0 ∥2 ≤ ϵ. Then the minimizer
θb = argminθ h(θ) exists and satisfies

∥θb − θ0 ∥2 ≤ .
λ
Proof By Taylor’s theorem, for any θ we have
1
h(θ) = h(θ0 ) + ⟨∇h(θ0 ), θ − θ0 ⟩ + (θ − θ0 )⊤ ∇2 h(θ)(θ − θ0 )
2
for a point θ on the line between θ and θ0 . Now, let us take θ such that ∥θ − θ0 ∥2 ≤ t. Then by
assumption ∇2 h(θ) ⪰ λI, and so we have
λ
h(θ) ≥ h(θ0 ) + ⟨∇h(θ0 ), θ − θ0 ⟩ + ∥θ − θ0 ∥22
2
λ
≥ h(θ0 ) − γ ∥θ − θ0 ∥2 + ∥θ − θ0 ∥22 .
2
by assumption that ∥∇h(θ0 )∥2 ≤ γ and the Cauchy-Schwarz inequality.
Fix t ≥ 0. If we can show that h(θ) > h(θ0 ) for all θ satisfying ∥θ − θ0 ∥2 = t, then Lemma 5.3.5
implies that h(θ) > h(θ0 ) whenever ∥θ − θ0 ∥2 ≥ t, so that necessarily ∥θb − θ0 ∥2 < t. Returning to
the previous display and letting t = ∥θ − θ0 ∥2 , note that
λ 2
h(θ) ≥ h(θ0 ) − γt + t ,
2

121
Lexture Notes on Statistics and Information Theory John Duchi

and as −γt + λ2 t2 = t( λ2 t − γ) > 0 whenever t > 2γ 2


λ . As by assumption ∇ h(θ) ⪰ λI whenever

∥θ − θ0 ∥2 ≤ ϵ for some ϵ ≥ λ , this implies the result.

Lemma 5.3.7. Let h be convex and assume that ∇2 h is M2 -Lipschitz (part (iii) of Assump-
λ2
tion A.5.1). Let λ > 0 be large enough and γ > 0 be small enough that γ < 8M 2
. Then if
2
both ∇ h(θ0 ) ⪰ λI and ∥∇h(θ0 )∥ ≤ γ, the minimizer θb = argmin h(θ) exists and satisfies
2 θ


∥θb − θ0 ∥2 ≤ .
λ

Proof By Lemma 5.3.6, it is enough to show that ∇2 h(θ) ⪰ λ2 for all θ with ∥θ − θ0 ∥2 ≤ 4γ λ . For
2
this, we use the M2 -Lipschitz continuity of ∇ h to obtain that for any θ with ∥θ − θ0 ∥2 = t,

∇2 h(θ) ⪰ ∇2 h(θ0 ) − M2 ∥θ − θ0 ∥2 I ⪰ (λ − M2 t)I.


λ 4γ λ λ
So if t ≤ 2M 2
we have ∇2 h(θ) ⪰ λI. Because λ ≤ 2M2 by assumption, we have ∇2 h(θ) ⪰ 2I
whenever ∥θ − θ0 ∥2 ≤ 4γ
λ , yielding the result.

5.3.3 Convergence analysis for convex M-estimators


By leveraging Lemma 5.3.7, to show a convergence rate guarantee for the empirical minimizer θbn ,
it is evidently sufficient to demonstrate two (related) conditions: that for some sequence γn → 0,
we have
∥∇Ln (θ⋆ )∥2 ≤ γn (5.3.2a)
with high probability, and that for some λ > 0, we have

∇2 L(θ⋆ ) ⪰ λI (5.3.2b)

with high probability. Happily, the convergence guarantees we develop in Chapter 4 provide pre-
cisely the tools to do this.

Theorem 5.3.8. Let Assumption A.5.1 hold for the M-estimation problem (5.3.1). Let δ ∈ (0, 12 ),
and define

2 ∇2 L(θ⋆ ) op
r ! ( r )
M0 1 2d 4M1 2d
γn (δ) := √ 1 + log and ϵn (δ) := max √ log , log .
n δ n δ 3n δ

Then we have both ∥∇Ln (θ⋆ )∥2 ≤ γn (δ) and ∇2 Ln (θ⋆ ) − ∇2 L(θ⋆ ) op
≤ ϵn (δ) with probability at
λ 9λ2
least 1 − 2δ. So long as ϵn (δ) ≤ 4 and γn (δ) ≤ 128M2 , then with the same probability,

16 γn (δ)
∥θbn − θ⋆ ∥2 ≤ .
3 λ

122
Lexture Notes on Statistics and Information Theory John Duchi

We defer the proof of the theorem to Section 5.3.5, instead providing commentary and a few
examples of its application. Ignoring the numerical constants, the theorem roughly states the
following: once n is large enough that
2
M22 M02 1 ∇2 L(θ⋆ ) op d
n≳ log and n ≳ log , (5.3.3)
λ4 δ λ2 δ
with probability at least 1 − δ we have
r
⋆ M 0 1
∥θbn − θ ∥2 ≲ √ log . (5.3.4)
λ n δ
These finite sample results are, at least for large n, order optimal, as we will develop in the coming
sections on fundamental limits. Nonetheless, the conditions (5.3.3) are stronger than necessary,
typically requiring that n be quite large. In the exercises, we explore a class of quasi-self-concordant
losses, where the second derivative controls the third derivative, allowing more direct application
of Lemma 5.3.6, which allows reducing this sample size requirement. (See Exercises 5.5 and 5.6).
Example 5.3.9 (Logistic regression, Example 5.3.2 continued): Recalling the logistic loss
ℓ(θ, (x, y)) = log(1 + e−y⟨x,θ⟩
√ ) for y ∈ {±1} and x ∈ Rd , assume the domain X consists√of
vectors x with ∥x∥2 ≤ d. For example, if X ⊂ [−1, 1]d , this holds. In this case, M0 = d,
while M1 ≤ 14 d and M2 ≲ d3/2 . Assuming that the population Hessian ∇2 L(θ⋆ ) = E[pθ⋆ (Y |
X)(1 − pθ⋆ (Y | X))XX ⊤ ] has minimal eigenvalue λmin (∇2 L(θ⋆ )) ≳ 1, then the conclusions of
Theorem 5.3.8 apply as soon as n ≳ d4 log 1δ . 3
When n is large enough, the guarantee (5.3.4) allows us to also make the heuristic asymptotic
expansions for the exponential family models in Section 3.2.1 hold in finite samples. Let the
conclusions of Theorem 5.3.8 hold, so that ∇2 Ln (θ⋆ ) − ∇2 L(θ⋆ ) op ≤ ϵn (δ) and so on. Then once
we know that θbn exists, by a Taylor expansion we can write
0 = ∇Ln (θbn ) = ∇Ln (θ⋆ ) + (∇2 Ln (θ⋆ ) + En )(θbn − θ⋆ )
= ∇Ln (θ⋆ ) + (∇2 L(θ⋆ ) + E ′ )(θbn − θ⋆ ),
n

where En is an error matrix satisfying ∥En ∥op ≤ M2 ∥θbn − θ⋆ ∥2 and En′ = En + ∇2 Ln (θ⋆ ) − ∇2 L(θ⋆ )
satisfies ∥En′ ∥op ≤ ∥En ∥op + ϵn (δ) by the triangle inequality and Theorem 5.3.8. Using the infinite
series expansion of the inverse

X
−1 −1
(A + E) =A + (−1)i (A−1 E)i A−1 ,
i=1

valid for A ≻ 0 whenever ∥E∥op < λmin (A) (see Exercise 5.4), we therefore have

θbn − θ⋆ = −(∇2 L(θ⋆ ) + En′ )−1 ∇Ln (θ⋆ ) = −∇2 L(θ⋆ )−1 ∇Ln (θ⋆ ) + Rn ,
where the remainder vector Rn satisfies
∥Rn ∥2 ≲ ∇2 L(θ⋆ )−1 En′ ∇2 L(θ⋆ )−1 op ∥∇Ln (θ⋆ )∥2
 q 
1
M
1  2 0M log δ log dδ

M2 M02


≲ 2 √ + ϵn (δ) ∥Ln (θ )∥2 ≲ 2 ·
 + ∇2 L(θ⋆ ) op
λ λ n λ n λ

with probability at least 1 − δ. We summarize this in the following corollary.

123
Lexture Notes on Statistics and Information Theory John Duchi

Corollary 5.3.10. Let the conditions of Theorem 5.3.8 hold. Then there exists a problem-dependent
constant C such that the following holds: for any δ > 0, for any n ≥ C log 1δ , with probability at
least 1 − δ
θbn − θ⋆ = −∇2 L(θ⋆ )−1 ∇Ln (θ⋆ ) + Rn ,
where the remainder Rn satisfies ∥Rn ∥2 ≤ C · n1 log dδ . The constant C may be taken to be continuous
in all the problem parameters of Assumption A.5.1.

Corollary 5.3.10 highlights the two salient terms governing error in estimation problems: the
curvature of the loss near the optimum, as ∇2 L(θ⋆ ) contributes, and the variance in the gradients
∇Ln (θ⋆ ) = n1 ni=1 ∇ℓ(θ⋆ , Zi ). When the Hessian term ∇2 L(θ⋆ ) is “large,” meaning that ∇2 L(θ⋆ ) ⪰
P
λI for some large value λ > 0, then estimation is easier: the curvature of the loss helps to identify
θ⋆ . Conversely, when the variance Var(∇ℓ(θ⋆ , Z)) = E[∥∇ℓ(θ⋆ , Z)∥22 ] is large, then estimation is
more challenging. As a final remark, let us imagine that the remainder term Rn in the corollary,
in addition to being small with high probability, satisfies E[∥Rn ∥22 ] ≤ nC2 , where C is a problem-
dependent constant. Let Gn = −∇2 L(θ⋆ )−1 ∇Ln (θ⋆ ) be the leading term in the expansion, which
satisfies
1 tr(∇2 L(θ⋆ )−1 Cov(∇ℓ(θ⋆ , Z))∇2 L(θ⋆ )−1 )
E[∥Gn ∥22 ] = Var(L(θ⋆ )−1 ∇ℓ(θ⋆ , Z)) = .
n n

Then because ∥Gn + Rn ∥22 ≤ ∥Gn ∥22 + ∥Rn ∥22 + 2 ∥Gn ∥2 ∥Rn ∥2 , we have the heuristic
h i h i
2
E θbn − θ⋆ 2 = E ∥Gn + Rn ∥22
= E[∥Gn ∥22 ] + E[∥Rn ∥22 ] ± 2E[∥Gn ∥2 ∥Rn ∥2 ]
(⋆) tr(∇2 L(θ⋆ )−1 Cov(∇ℓ(θ⋆ , Z))∇2 L(θ⋆ )−1 ) C
= ± 3/2 , (5.3.5)
n n
where C is a problem-dependent constant and the step (⋆) is heuristic. See Exercise 5.7 for one
approach to make this step rigorous.

5.3.4 Consequences for exponential families and generalized linear models


Working through a few example applications of Corollary 5.3.10 with the exponential family and
generalized linear models of Chapter 3 can help to make the results and connections clearer. Recall
that for an exponential family model with loss ℓ(θ; x) = − log pθ (x) = −⟨θ, ϕ(x)⟩ + A(θ), we
heuristically derived in expression (3.2.4) that if the data were i.i.d. from the exponential family
model Pθ⋆ , then
·
θbn − θ⋆ ∼ N 0, n−1 · ∇2 A(θ⋆ )−1 .


Corollary 5.3.10 presents one approach to make this rigorous. Assuming the sufficient statistics ϕ
are bounded, we have ∇2 L(θ⋆ ) = ∇2 A(θ⋆ ), and Covθ⋆ (∇Ln (θ⋆ )) = n1 Covθ⋆ (ϕ(X)) = n1 ∇2 A(θ⋆ ).
So
n
1X
θbn − θ⋆ = −∇2 A(θ⋆ )−1 (ϕ(Xi ) − ∇A(θ⋆ )) + Rn ,
n
i=1

where the remainder satisfies ∥Rn ∥2 ≤ logC n1 d


δ.
To obtain finite sample expected bounds requires a bit of tedium because of small probability
events (e.g., that the sampled Hessian matrix ∇2 Ln (θ⋆ ) fails to be invertible). One simple device

124
Lexture Notes on Statistics and Information Theory John Duchi

is to consider the estimator θbn only on some “good” event En that occurs with high probability, for
example, that the remainder Rn is small. The next corollary provides a prototypical result under the
iid
assumption that Xi ∼ Pθ⋆ for an exponential family model with bounded data supx∈X ∥ϕ(x)∥2 < ∞
and positive definite Hessian ∇2 A(θ⋆ ) ≻ 0.

Corollary 5.3.11. Under the preceding conditions on the exponential family model Pθ⋆ , there exists
a problem dependent constant C < ∞ such that the following holds: for any k ≥ 1, there are events
En with probability P(En ) ≥ 1 − n1k and
h i 1  Ck log n
2
Eθ⋆ θbn − θ⋆ 2
· 1 {En } ≤ tr ∇2 A(θ⋆ )−1 + .
n n3/2
The constant C may be taken continuous in θ⋆ .

Recalling the equality (3.3.2), we see that the Fisher information ∇2 A(θ) appears in a fundamental
way for the exponential families. Proposition 3.5.1 shows that this quantity is fundamental, at
least for testing; here it provides an upper bound on the convergence of the maximum likelihood
estimator. Exercise 5.8 extends Corollary 5.3.11 to an equality to within lower order terms.
Proof Let δ = δn > 0 to be chosen and define the event En to be that ∥Rn ∥2 ≤ C n1 log dδ , which
occurs with probability at least 1 − δ. Let

Gn = −∇2 L(θ⋆ )−1 ∇Ln (θ⋆ ) = −∇2 A(θ⋆ )−1 Pn (ϕ(X) − ∇A(θ⋆ ))

be the mean-zero gradient term. Then θbn = Gn + Rn , and ∥Gn + Rn ∥22 ≤ ∥Gn ∥22 + ∥Rn ∥22 +
2 ∥Gn ∥2 ∥Rn ∥2 . Then on the event En we have ∥Rn ∥22 ≤ Cn log dδ , and so
h
2
i C2 d
E θbn − θ⋆ 2
1 {En } ≤ E[∥Gn ∥22 ] + E[∥Rn ∥22 1 {En }]E[∥Gn ∥2 ] + 2
log2 .
n δ
Now note that E[∥Gn ∥22 ] = 1
n tr(∇2 A(θ⋆ )−1 ), and set δ = 1
nk
.

These ideas also extend to generalized linear models, such as linear, logistic, or Poisson regression
(recall Chapter 3.4). For the abstract generalized linear model of predicting a target y ∈ Y from
covariates x ∈ X , we have

ℓ(θ, (x, y)) = − log pθ (y | x) = −ϕ(x, y)⊤ θ + A(θ | x).

Because the log partition is C ∞ , the smoothness conditions in Assumption A.5.1 then reduce to the
boundedness
rad2 ({ϕ(x, y) | x ∈ X , y ∈ Y}) := sup ∥ϕ(x, y)∥2 < ∞.
x∈X ,y∈Y

Assuming that a minimizer θ⋆ = argminθ L(θ) exists, the (local) strong convexity condition that
∇2 L(θ⋆ ) ≻ 0 then becomes that E[∇2 A(θ | X)] = E[Covθ (ϕ(X, Y ) | X)] ≻ 0. Exercise 5.10, part (c)
gives general sufficient conditions for the existence of minimizers in GLMs.
For logistic regression (Example 3.4.2), these conditions correspond to a bound on the covariate
data x, that E[XX ⊤ ] ≻ 0, and that for each X, the label Y is non-deterministic. For Poisson

regression (Example 3.4.4), we have ℓ(θ, (x, y)) = −yx⊤ θ + eθ x . When the count data Y ∈ N can
be unbounded, Assumption A.5.1.(i) may fail, because yx may be unbounded. If we model a data-
generating process for which X and Y are both bounded using Poisson regression, however, then

125
Lexture Notes on Statistics and Information Theory John Duchi

the smoothness conditions in Assumption A.5.1 hold. (Again, see Exercise 5.10 for the existence
of solutions.)
Regardless, by an argument completely parallel to that for Corollary 5.3.10, we can provide
convergence rates for generalized linear model estimators. Here, we avoid the assumption of model
fidelity, instead assuming that θ⋆ = argminθ L(θ) exists and ∇2 L(θ⋆ ) = E[∇2 A(θ⋆ | X)] ≻ 0, so
that θ⋆ is unique.

Corollary 5.3.12. Let the preceding conditions hold and pθ be a generalized linear model. Then
there exists a problem constant C < ∞ such the following holds: for any δ ∈ (0, 1) and for all
n ≥ C log 1δ , with probability at least 1 − δ

θbn − θ⋆ = −E[∇2 A(θ⋆ | X)]−1 Pn (ϕ(X, Y ) − ∇A(θ⋆ | X)) + Rn ,


C
where the remainder ∥Rn ∥2 ≤ n log dδ .

When the generalized linear model Pθ is correct, so that X ∼ P marginally and Y | X ∼ Pθ (· | X),
then Cov(ϕ(X, Y ) | X) = ∇2 A(θ⋆ | X), and so in this case (again, for a sequence of events En with
probability at least 1 − 1/nk ), we have
h
2
i tr(E[∇2 A(θ⋆ | X)]−1 ) Ck log n
Eθ⋆ θbn − θ⋆ 2
1 {En } ≤ + .
n n3/2
Note that this quantity is the trace of the inverse Fisher information (3.3.2) in the generalized
linear model: the “larger” the information, the better estimation accuracy we can guarantee.

5.3.5 Proof of Theorem 5.3.8


The two key steps in the proof of the theorem are lemmas providing the guarantees (5.3.2).

Lemma 5.3.13. Let Assumption A.5.1 hold. Then for any δ ∈ (0, 1), with probability at least 1 − δ
r !
M 0 1
∥∇Ln (θ⋆ )∥2 ≤ √ 1 + 2 log .
n δ

Proof The function z1n 7→ ∥Ln (θ)∥2 satisfies bounded differences: for any two empirical samples
Pn , Pn′ differing in only observation i,

∥Pn ∇ℓ(θ, Z)∥2 − Pn′ ∇ℓ(θ, Z) ≤ Pn ∇ℓ(θ, Z) − Pn′ ∇ℓ(θ, Z) 2


2
1 2M0
≤ ∇ℓ(θ, Zi ) − ∇ℓ(θ, Zi′ ) 2 ≤
n n

q
by Assumption A.5.1.(i). Because E[∥∇Ln (θ⋆ )∥2 ] ≤ E[∥∇Ln (θ⋆ )∥22 ] ≤ M0 / n, Proposition 4.2.5
gives that
√ nt2
 


P ∥∇Ln (θ )∥2 ≥ M0 / n + t ≤ exp −
2M02
nt 2
for all t ≥ 0. Solving for t in exp(− 2M 2 ) = δ yields the lemma.
0

126
Lexture Notes on Statistics and Information Theory John Duchi

Lemma 5.3.14. Let Assumption A.5.1 hold. Then for any δ ∈ (0, 1), with probability at least 1 − δ
2 ∇2 L(θ⋆ ) op
( r )
2d 4M 1 2d
∇2 Ln (θ⋆ ) − ∇2 L(θ⋆ ) op ≤ max √ log , log .
n δ 3n δ
1 Pn
Proof Because ∇2 Ln (θ) = n i=1 ∇
2 ℓ(θ, Z
i) and ∇2 ℓ(θ, z) op
≤ M1 by Assumption [Link],
Theorem 4.3.6 implies that
( )!
  nt2 3t
P ∇2 Ln (θ⋆ ) − ∇2 L(θ⋆ ) op
≥ t ≤ 2d exp − min 2 , 4M .
2 ⋆
4 ∥∇ L(θ )∥op 1

2∥∇2 L(θ⋆ )∥op


q
4M1
Setting t = max{ √
n
log 2d 2d
δ , 3n log δ } gives the lemma.

Let ϵn (δ) be the bound on the right side of Lemma 5.3.14. Then with probability at least 1 − δ,

∇2 Ln (θ⋆ ) ⪰ L(θ⋆ ) − ϵn (δ)Id ≥ Id
4
by the assumption that n is large enough and ∇2 L(θ⋆ ) ⪰ λI; we therefore have the first condition
of Lemma 5.3.7 where 3λ/4 replaces λ. Lemma 5.3.13 gives that with probability at least 1 − δ,
9λ2
∥∇Ln (θ⋆ )∥2 ≤ γn (δ). Now use the assumption that γn (δ) ≤ 128M 2
, so that Lemma 5.3.7 implies

16 γn (δ)
∥θbn − θ⋆ ∥2 ≤ ,
3 λ
which proves the theorem.

5.4 Exercises
JCD Comment: Exercise ideas around this: We could try to do things with moment
bounds. Like we’d use something like the Marcinkiewicz bounds, and then some moment
bounds on matrices from different exercises, and we could say something.
Probably also write some moment bound guarantees for matrices with operator norms
would be really neat.
Also, exercises that handle dimension scaling could be fun, along with (associated) con-
vergence rates.
Exercise 5.1: In this question, we show how to use Bernstein-type (sub-exponential) inequal-
ities to give sharp convergence guarantees. Recall (Example 4.1.14, Corollary 4.1.18, and inequal-
ity (4.1.6)) that if Xi are independent bounded random variables with |Xi − E[X]| ≤ b for all i and
Var(Xi ) ≤ σ 2 , then
n n
( ! !)
5 nt2 nt
  
1X 1X 1
max P Xi ≥ E[X] + t , P Xi ≤ E[X] − t ≤ exp − min , .
n n 2 6 σ 2 2b
i=1 i=1

We consider minimization of loss functions ℓ over finite function classes F with ℓ ∈ [0, 1], so that if
L(f ) = E[ℓ(f, Z)] then |ℓ(f, Z) − L(f )| ≤ 1. Throughout this question, we let

L⋆ = min L(f ) and f ⋆ ∈ argmin L(f ).


f ∈F f ∈F

127
Lexture Notes on Statistics and Information Theory John Duchi

We will show that, roughly, a procedure based on picking an empirical risk minimizer is unlikely to
choose a function f ∈ F with bad performance, so that we obtain faster concentration guarantees.

(a) Argue that for any f ∈ F

nt2
  
    1 5 nt
P L(f ) ≥ L(f ) + t ∨ P L(f ) ≤ L(f ) − t ≤ exp − min
b b , .
2 6 L(f )(1 − L(f )) 2

(b) Define the set of “bad” prediction functions Fϵ bad := {f ∈ F : L(f ) ≥ L⋆ + ϵ}. Show that for
any fixed ϵ ≥ 0 and any f ∈ F2ϵ bad , we have

nϵ2
  
b ) ≤ L⋆ + ϵ ≤ exp − 1 min 5 nϵ
 
P L(f , .
2 6 L⋆ (1 − L⋆ ) + ϵ(1 − ϵ) 2

(c) Let fbn ∈ argminf ∈F L(f


b ) denote the empirical minimizer over the class F. Argue that it is
likely to have good performance, that is, for all ϵ ≥ 0 we have

nϵ2
  


 1 5 nϵ
P L(fn ) ≥ L(f ) + 2ϵ ≤ card(F) · exp − min
b , .
2 6 L⋆ (1 − L⋆ ) + ϵ(1 − ϵ) 2

(d) Using the result of part (c), argue that with probability at least 1 − δ,
q
|F | L⋆ (1 − L⋆ ) · log |Fδ |
r
4 log 12
L(fbn ) ≤ L(f ⋆ ) + δ
+ · √ .
n 5 n

Why is this better than an inequality based purely on the boundedness of the loss ℓ, such as
Theorem 5.2.4 or Corollary 5.2.6? What happens when there is a perfect risk minimizer f ⋆ ?

Exercise 5.2: Prove Lemma 5.1.8.


Exercise 5.3: Consider a binary classification problem with logistic loss ℓ(θ; (x, y)) = log(1 +
exp(−yθT x)), where θ ∈ Θ := {θ ∈ Rd | ∥θ∥1 ≤ r} and y ∈ {±1}. Assume additionally that the
space X ⊂ {x ∈ Rd | ∥x∥∞ ≤ b}. Define the empirical and population risks L b n (θ) := Pn ℓ(θ; (X, Y ))
and L(θ) := P ℓ(θ; (X, Y )), and let θbn = argminθ∈Θ L(θ).
b Show that with probability at least 1 − δ
iid
over (Xi , Yi ) ∼ P , q
rb log dδ
L(θbn ) ≤ inf L(θ) + C √
θ∈Θ n
where C < ∞ is a numerical constant (you need not specify this).
Exercise 5.4: Let A ≻ 0 be a positive
Pk definite matrix, and let E be Hermitian and satisfy
∥E∥op < λmin (A). Define Sk := A + i=1 (−1) (A E)A−1 .
−1 i −1

(a) Show that for any k ∈ N,


k+1
(A + E)Sk = I + (−1)k EA−1 .
P∞
(b) Argue that S∞ := limk Sk exists and (A + E)−1 = S∞ = A−1 + i −1 i −1
i=1 (−1) (A E) A .

128
Lexture Notes on Statistics and Information Theory John Duchi

(c) Let γ = ∥E∥op /λmin (A) < 1. Show that

1
(A + E)−1 − Sk op
≤ γ k+1 .
λmin (A)(1 − γ)

Exercise 5.5: Let f : R → R be a three-times differentiable convex function. We say f is


C-quasi-self-concordant (q.s.c.) if |f ′′′ (t)| ≤ C|f ′′ (t)| for all t ∈ R.

(a) Define g(t) = log f ′′ (t). Show that if f is C-q.s.c., then |g ′ (t)| ≤ C and so for any s ∈ R,

e−C|s| f ′′ (t) ≤ f ′′ (t + s) ≤ eC|s| f ′′ (t).

(b) Show that the function f (t) = log(1 + et ) + log(1 + e−t ) is q.s.c., and give its self-concordance
parameter.

(c) Show that the function f (t) = log(1 + et ) is q.s.c., and give its self-concordance parameter.

(d) Let f be C-q.s.c. and for a fixed x ∈ Rd , define h(θ) := f (⟨θ, x⟩). Show that for any θ, θ0 with
∆ = θ − θ0 , h satisfies

e−C|⟨∆,x⟩| ∇2 h(θ) ⪯ ∇2 h(θ0 ) ⪯ eC|⟨∆,x⟩| ∇2 h(θ).

Exercise 5.6 (Quasi self-concordant M-estimators [152]): Consider a prediction problem of


predicting targets y from vectors x ∈ Rd . A loss ℓ is a C-quasi self-concordant loss if we can write

ℓ(θ, (x, y)) = h(⟨θ, x⟩, y),

where for each y, h(·, y) is a C-q.s.c. function (recall Exercise 5.5).

(a) Show that logistic regression with loss ℓ(θ, (x, y)) = log(1+e−y⟨x,θ⟩ ) and robust linear regression
with loss ℓ(θ, (x, y)) = log(1 + ey−⟨x,θ⟩ ) + log(1 + e⟨x,θ⟩−y ) are both 1-q.s.c. losses.

For the remainder of the problem, assume that the data X ⊂ Rd satisfy ∥x∥2 ≤ d for all x ∈ X .
Let L(θ) = E[ℓ(θ, (X, Y ))] and θ⋆ = argminθ L(θ). Assume that ∇2 L(θ⋆ ) ⪰ λI, where λ > 0 is
fixed, and let Ln (θ) = Pn ℓ(θ, (X, Y )) as usual.

(b) Show that if ∥θ − θ⋆ ∥2 ≤ 1/ d, then ∇2 Ln (θ) ⪰ e−C ∇2 Ln (θ⋆ ).

(c) Argue that if t = ∥θ − θ⋆ ∥2 ≤ √1 and ∥∇Ln (θ⋆ )∥2 ≤ γ, then


d

e−C λmin (∇2 Ln (θ⋆ )) 2


Ln (θ) ≥ Ln (θ⋆ ) − γt + t .
2

(d) Let ℓ be a 1-q.s.c. loss and assume that |h′ (t, y)| ≤ 1 and |h′′ (t, y)| ≤ 1 for all t ∈ R. Give a
result similar to that of Theorem 5.3.8, but show that your conclusions hold with probability
at least 1 − δ as soon as
d2 d
n ≳ 2 log .
λ δ

129
Lexture Notes on Statistics and Information Theory John Duchi

Exercise 5.7 (Truncation to obtain a moment bound): Let B < ∞. Show that under the
conditions of Corollary 5.3.10,
h i tr(∇2 L(θ⋆ )−1 Cov(∇ℓ(θ⋆ , Z))∇2 L(θ⋆ )−1 ) C log n
2
E θbn − θ⋆ 2
∧B ≤ + ,
n n3/2
where C is a problem-dependent constant.
Exercise 5.8: Let θbn = argminθ Ln (θ) for an M-estimation problem satisfying the conditions of
Corollary 5.3.10. Show that for any k ≥ 1, there are events En with P(En ) ≥ 1 − n−k and for which
h
2
i 1 Ck log n
θbn − θ⋆ tr ∇2 L(θ⋆ )−1 Cov(∇ℓ(θ⋆ , Z))∇2 L(θ⋆ )−1 ≤

E 2
1 {En } − ,
n n3/2

where C is a problem-dependent constant.


Exercise 5.9: In this problem, you provide sufficient conditions for exponential family models to
have minimizers.

(a) Let Pθ be a minimal exponential family (Definition 3.2) with density pθ (x) = exp(θ⊤ x − A(θ))
with respect to a base measure µ. Show that for any θ⋆ ∈ dom A, the well-specified population
loss L(θ) = −Eθ⋆ [log pθ (X)] has unique minimizer θ⋆ .

For the remainder of the problem, we no longer assume the model is well-specified. For a measure
µ on a set X , recall that the essential supremum of a function f : X → R is

ess sup(f ) := inf {t ∈ R | µ ({f (x) ≥ t}) = 0} .


µ

We say that a measure µ on Rd essentially covers a vector v ∈ Rd if

v ⊤ u < ess sup{x⊤ u} for all u ̸= 0. (5.4.1)


µ

(b) Let {Pθ } be the exponential family with density pθ (x) = exp(θ⊤ x − A(θ)) with respect to the
measure µ. Let X be a random variable for which µ essentially covers (5.4.1) the mean E[X].
Show that L(θ) = −E[log pθ (X)] has a minimizer.

Hint. A continuous convex function h has a minimizer if it is coercive, meaning that h(θ) → ∞
whenever ∥θ∥2 → ∞. Corollary C.3.7, part (i) may be useful.
Exercise 5.10: In this problem, you provide sufficient conditions for generalized linear models to
have minimizers. Let L(θ) = E[ℓ(θ, (X, Y ))] be the population loss (which may be misspecified).
⊤x
(a) Consider Poisson regression with loss ℓ(θ, (x, y)) = −yx⊤ θ + eθ . Show that L has a unique
minimizer if E[X] = 0 and Cov(X) = E[XX ⊤ ] ≻ 0.

(b) Consider logistic regression with loss ℓ(θ, (x, y)) = −yx⊤ θ + log(1 + eθ x ) for y ∈ {0, 1}. Show
that L has a unique minimizer if E[XX ⊤ ] ≻ 0 and 0 < P(Y = 1 | X) < 1 with probability 1
over X.

130
Lexture Notes on Statistics and Information Theory John Duchi

(c) Consider a generalized linear model with densities pθ (y | x) = exp(ϕ(x, y)⊤ θ − A(θ | x)) w.r.t.
a base measure µ(· | x) on y ∈ Y, and assume for simplicity that µ(Y | x) = 1 for all x. Assume
that for each vector v ∈ Sd−1 and x ∈ X ,
ess sup{v ⊤ ϕ(x, y)} ≥ EP [v ⊤ ϕ(x, Y ) | X = x],
µ(·|x)

and the set of x for which a strict inequality holds has positive P -probability. (This is equivalent
to the set of x for which µ(· | x) essentially covers (5.4.1) the conditional mean EP [ϕ(x, Y ) |
X = x] having positive probability.) Show that a minimizer of L exists. You may assume
E[|A(θ | X)|] < ∞ for ∥θ∥2 ≤ 1 if it is convenient.
Hint. The techniques to solve Exercise 5.9 may be useful. In addition, see Exercise 4.2.
Exercise 5.11: Consider the robust regression setting of Example 5.3.3, and let h ≥ 0 be a
symmetric convex function, twice continuously differentiable in a neighborhood of 0. Assume that
for any (measurable) subset X0 ⊂ X , E[Y | X ∈ X0 ] exists and is finite, and assume E[XX ⊤ ] ≻
0. Show that a minimizer of L(θ) := E[h(⟨θ, X⟩ − Y )] exists. Hint. Show that L is coercive.
Corollary C.3.7, part (i) may be useful.
Exercise 5.12 (The delta method for approximate sums): Let T : Rd → Rp be a differentiable
function with derivative matrix Ṫ (θ) ∈ Rp×d , so that T (θ + ∆) = T (θ) + Ṫ (θ)∆ + o(∥∆∥) as ∆ → 0.
Let θbn ∈ Rd be a sequence of random vectors with
θbn − θ = Pn Z + Rn ,
where Zi are i.i.d. and Rn is a remainder term.

(a) Assume that E[∥Zi ∥22 ] < ∞ and that for each ϵ > 0, P(∥Rn ∥2 ≥ ϵ/ n) → 0 as n → ∞. Show
that
θbn − θ = Ṫ (θ)Pn Z + Rn′ ,

where the remainder Rn′ also satisfies P(∥Rn′ ∥2 ≥ ϵ/ n) → 0 for all ϵ > 0.
(b) Assume that T is locally smooth enough that for some K < ∞ and δ > 0,
T (θ + ∆) − T (θ) − Ṫ (θ)∆ ≤ K ∥∆∥22
2
when ∥∆∥2 ≤ δ. Assume additionally that there exist C0 , C1 < ∞ such that for t ≥ 0, we have
∥Rn ∥2 ≤ Cn0 t with probability at least 1 − e−t and that P(∥Pn Z∥2 ≥ t) ≤ C1 exp(−nt2 /σ 2 ).
Give a quantitative version of part (a).

JCD Comment: Add in some connections to the exponential family material. Some
ideas:

1. A hypothesis test likelihood ratio for them (see page 40 of handwritten notes)

2. A full learning guarantee with convergence of Hessian and everything, e.g., for logistic
regression?

3. In the Ledoux-Talagrand stuff, maybe worth going through example of logistic regres-
sion. Also, having working logistic example throughout? Helps clear up the structure
and connect with exponential families.

4. Maybe an exercise for Lipschitz functions with random Lipschitz constants?

131
Chapter 6

Generalization and stability

Concentration inequalities provide powerful techniques for demonstrating when random objects
that are functions of collections of independent random variables—whether sample means, functions
with bounded variation, or collections of random vectors—behave similarly to their expectations.
This chapter continues exploration of these ideas by incorporating the central thesis of this book:
that information theory’s connections to statistics center around measuring when (and how) two
probability distributions get close to one another. On its face, we remain focused on the main
objects of the preceding chapter, where we have a population probability distribution P on a space
X and some collection of functions f : X → R. We then wish to understand when we expect the
empirical distribution
n
1X
Pn := 1Xi ,
n
i=1
iid
defined by the sample Xi ∼ P , to be close to the population P as measured by f . Following the
notation we introduce in Section 5.1, for P f := EP [f (X)], we again ask to have
n
1X 
Pn f − P f = f (Xi ) − EP [f (X)]
n
i=1

to be small simultaneously for all f .


In this chapter, however, we develop a family of tools based around PAC (probably approximately
correct) Bayesian bounds, where we slightly perturb the functions f of interest to average them in
some way; when these perturbations keep Pn f stable, we expect that Pn f ≈ P f , that is, the sample
generalizes to the population. These perturbations allow us to bring the tools of the divergence
measures we have developed to bear on the problems of convergence and generalization. They also
allow us to go beyond the “basic” concentration inequalities to situations with interaction, where
a data analyst may evaluate some functions of Pn , then adaptively choose additional queries or
analyses to do on the sample sample X1n . This breaks standard statistical analyses—which assume
an a priori specified set of hypotheses or questions to be answered—but is possible to address
once we can limit the information the analyses release in precise ways that information-theoretic
tools allow. Even more, in the next chapter we show how they form the basis for transportation
inequalities, powerful tools for concentration of measure. Modern work has also shown how to
leverage these techniques, coupled with computation, to provide non-vacuous bounds on learning
for complicated scenarios and models to which all classical bounds fail to apply, such as deep
learning.

132
Lexture Notes on Statistics and Information Theory John Duchi

6.1 The variational representation of Kullback-Leibler divergence


The starting point of all of our generalization bounds is a surprisingly simply variational result,
which relates expectations, moment generating functions, and the KL-divergence in one single
equality. It turns out that this inequality, by relating means with moment generating functions
and divergences, allows us to prove generalization bounds based on information-theoretic tools and
stability.

Theorem 6.1.1 (Donsker-Varadhan variational representation). Let P and Q be distributions on


a common space X . Then
n o
Dkl (P ||Q) = sup EP [g(X)] − log EQ [eg(X) ] ,
g

where the supremum is taken over measurable functions g : X → R with EQ [eg(X) ] < ∞. We can
also replace this by bounded simple functions g.

We give one proof of this result and one sketch of a proof, which holds when the underlying space
is discrete, that may be more intuitive: the first constructs a particular “tilting” of Q via the
function eg , and verifies the equality. The second relies on the discretization of the KL-divergence
and may be more intuitive to readers familiar with convex optimization: essentially, we expect this
result because the function log( kj=1 exj ) is the convex conjugate of the negative entropy. (See also
P
Exercise 6.1.)
Proof We may assume that P is absolutely continuous with respect to Q, meaning that Q(A) = 0
implies that P (A) = 0, as otherwise both sides are infinite by inspection. Thus, it is no loss of
generality to let P and Q have densities p and q.
Attainment in the equality is easy: we simply take g(x) = log p(x) q(x) , so that EQ [e
g(X) ] = 1. To

show that the right hand side is never larger than Dkl (P ||Q) requires a bit more work. To that
end, let g be any function such that EQ [eg(X) ] < ∞, and define the random variable Zg (x) =
eg(x) /EQ [eg(X) ], so that EQ [Z] = 1. Then using the absolute continuity of P w.r.t. Q, we have
     
p(X) q(X) dQ
EP [log Zg ] = EP log + log Zg (X) = Dkl (P ||Q) + EP log Zg
q(X) p(X) dP
 
dQ
≤ Dkl (P ||Q) + log EP Zg
dP
= Dkl (P ||Q) + log EQ [Zg ].

As EQ [Zg ] = 1, using that EP [log Zg ] = EP [g(X)] − log EQ [eg(X) ] gives the result.
For the claim that bounded simple functions are sufficient, all we need to do is demonstrate
(asymptotic) achievability. For this, we use the definition (2.2.1) of the KL-divergence as a
supremum over partitions. Take An be an sequence of partitions so that Dkl (P ||Q | An ) →
P P (A)
Dkl (P ||Q). Then let gn (x) = A∈An 1 {x ∈ A} log Q(A) , which gives Dkl (P ||Q | An ) = EP [gn (X)]−
log EQ [egn (X) ].

Here is the second proof of Theorem 6.1.1, which applies when X is discrete and finite. That we
can approximate KL-divergence by suprema over finite partitions (as in definition (2.2.1)) suggests
that this approach works in general—which it can—but this requires some not completely trivial

133
Lexture Notes on Statistics and Information Theory John Duchi

approximations of EP [g] and EQ [eg ] by discretized versions of their expectations, which makes
things rather tedious.
Proof of Theorem 6.1.1, the finite case As we have assumed that P and Q have finite
supports, which we identify with {1, . . . , k} and p.m.f.s p, q ∈ ∆k = {p ∈ Rk+ | ⟨1, p⟩ = 1}. Define
fq (v) = log( kj=1 qj evj ), which is convex in v (recall Proposition 3.2.1). Then the supremum in
P
the variational representation takes the form

h(p) := sup {⟨p, v⟩ − fq (v)} .


v∈Rk

If we can take derivatives and solve for zero, we are guaranteed to achieve the supremum. To that
end, note that
" #k
qi evi
∇v {⟨p, v⟩ − fq (v)} = p − Pk ,
vj
j=1 qj e i=1
p
so that setting vj = log qjj achieves p − ∇v fq (v) = p − p = 0 and hence the supremum. Noting that
p
log( kj=1 qj exp(log qjj )) = log( kj=1 pj ) = 0 gives h(p) = Dkl (p||q).
P P

The Donsker-Varadhan variational representation already gives a hint that we can use some
information-theoretic techniques to control the difference between an empirical sample and its
expectation, at least in an average sense. In particular, we see that for any function g, we have

EP [g(X)] ≤ Dkl (P ||Q) + log EQ [eg(X) ]

for any random variable X. Now, changing this on its head a bit, suppose that we consider a
collection of functions F and put two probability measures π and π0 on F, and consider Pn f − P f ,
where we consider f a random variable f ∼ π or f ∼ π0 . Then a consequence of the Donsker-
Varadhan theorem is that
Z Z
(Pn f − P f )dπ(f ) ≤ Dkl (π||π0 ) + log exp(Pn f − P f )dπ0 (f )

for any π, π0 . While this inequality is a bit naive—bounding a difference by an exponent seems
wasteful—as we shall see, it has substantial applications when we can upper bound the KL-
divergence Dkl (π||π0 ).

6.2 PAC-Bayes bounds


Probably-approximately-correct (PAC) Bayesian bounds proceed from a perspective similar to that
of the covering numbers and covering entropies we develop in Section 5.1, where if for a collection
of functions F there is a finite subset (a cover) {fv } such that each f ∈ F is “near” one of the
fv , then we need only control deviations of Pn f from P f for the elements of {fv }. In PAC-Bayes
bounds, we instead average functions f with other functions, and this averaging allows a similar
family of guarantees and applications.
Let us proceed with the main results. Let F be a collection of functions f : X → R, and
assume that each function f is σ 2 -sub-Gaussian, which we recall (Definition 4.1) means that
E[eλ(f (X)−P f ) ] ≤ exp(λ2 σ 2 /2) for all λ ∈ R, where P f = EP [f (X)] = f (x)dP (x) denotes the
R

134
Lexture Notes on Statistics and Information Theory John Duchi

expectation of f under P . The main theorem of this section shows that averages of the squared
error (Pn f − P f )2 of the empirical distribution Pn to P converge quickly to zero for all averaging
distributions π on functions f ∈ F so long as each f is σ 2 -sub-Gaussian, with the caveat that we
pay a cost for different choices of π. The key is that we choose some prior distribution π0 on F
first.

Theorem 6.2.1. Let Π be the collection of all probability distributions on the set F and let π0 be
a fixed prior probability distribution on f ∈ F. With probability at least 1 − δ,

8σ 2 Dkl (π||π0 ) + log 2δ


Z
(Pn f − P f )2 dπ(f ) ≤ simultaneously for all π ∈ Π.
3 n
Proof The key is to combine Example 4.1.12 with the variational representation that Theo-
rem 6.1.1 provides for KL-divergences. We state Example 4.1.12 as a lemma here.

Lemma 6.2.2. Let Z be a σ 2 -sub-Gaussian random variable. Then for λ ≥ 0,


2 1
E[eλZ ] ≤ q .
[1 − 2σ 2 λ]+

PWithout loss of generality, we assume that P f = 0 for all f ∈ F, and recall that Pn f =
1 n 2
n i=1 f (Xi ) is the empirical mean of f . Then we know that Pn f is σ /n-sub-Gaussian, and
−1/2
Lemma 6.2.2 implies that E[exp(λ(Pn f )2 )] ≤ 1 − 2λσ 2 /n +
 
for any f , and thus for any prior
π0 on f we have Z 
−1/2
exp(λ(Pn f ) )dπ0 (f ) ≤ 1 − 2λσ 2 /n + .
2

E

3n
Consequently, taking λ = λn := 8σ 2
, we obtain
Z  Z   
2 3n 2
E exp(λn (Pn f ) )dπ0 (f ) = E exp (Pn f ) dπ0 (f ) ≤ 2.
8σ 2

Markov’s inequality thus implies that


Z 
2
 2
P exp λn (Pn f ) dπ0 (f ) ≥ ≤ δ, (6.2.1)
δ
iid
where the probability is over Xi ∼ P .
Now, we use the Donsker-Varadhan equality (Theorem 6.1.1). Letting λ > 0, we define the
function g(f ) = λ(Pn f )2 , so that for any two distributions π and π0 on F, we have

Dkl (π||π0 ) + log exp(λ(Pn f )2 )dπ0 (f )


Z Z R
1 2
g(f )dπ(f ) = (Pn f ) dπ(f ) ≤ .
λ λ

This holds without any probabilistic qualifications, so using the application (6.2.1) of Markov’s
inequality with λ = λn , we thus see that with probability at least 1 − δ over X1 , . . . , Xn , simulta-
neously for all distributions π,

8σ 2 Dkl (π||π0 ) + log 2δ


Z
(Pn f )2 dπ(f ) ≤ .
3 n

135
Lexture Notes on Statistics and Information Theory John Duchi

This is the desired result (as we have assumed that P f = 0 w.l.o.g.).

By Jensen’s inequality (or Cauchy-Schwarz), it is immediate from Theorem 6.2.1 that we also
have
s
8σ 2 Dkl (π||π0 ) + log 2δ
Z
|Pn f − P f |dπ(f ) ≤ simultaneously for all π ∈ Π (6.2.2)
3 n

with probability at least 1 − δ, so that Eπ [|Pn f − P f |] is with high probability of order 1/ n. The
inequality (6.2.2) is the original form of the PAC-Bayes bound due to McAllester, with slightly
sharper constants and improved logarithmic dependence. The key is that stability, in the form of a
prior π0 and posterior π closeness, allow us to achieve reasonably tight control over the deviations
of random variables and functions with high probability.
Let us give an example, which is similar to many of our approaches in Section 5.2, to illustrate
some of the approaches this allows. The basic idea is that by appropriate choice of prior π0
and “posterior” π, whenever we have appropriately smooth classes of functions we achieve certain
generalization guarantees.

Example 6.2.3 (A uniform law for Lipschitz functions): Consider a case as in Section 5.2,
where we let L(θ) = P ℓ(θ, Z) for some function ℓ : Θ × Z → R. Let Bd2 = {v ∈ Rd | ∥v∥2 ≤ 1}
be the ℓ2 -ball in Rd , and let us assume that Θ ⊂ rBd2 and additionally that θ 7→ ℓ(θ, z) is
M -Lipschitz for all z ∈ Z. For simplicity, we assume that ℓ(θ, z) ∈ [0, 2M r] for all θ ∈ Θ (we
may simply relativize our bounds by replacing ℓ by ℓ(·, z) − inf θ∈Θ ℓ(θ, z) ∈ [0, 2M r]).
If L
b n (θ) = Pn ℓ(θ, Z), then Theorem 6.2.1 implies that
s
2 2
Z  
b n (θ) − L(θ)|dπ(θ) ≤ 8M r Dkl (π||π0 ) + log 2
|L
3n δ

for all π with probability at least 1 − δ. Now, let θ0 ∈ Θ be arbitrary, and for ϵ > 0 (to be
chosen later) take π0 to be uniform on (r + ϵ)Bd2 and π to be uniform on θ0 + ϵBd2 . Then we
immediately see that Dkl (π||π0 ) = d log(1+ rϵ ). Moreover, we have L
R
b n (θ)dπ(θ) ∈ L
b n (θ0 )±M ϵ
and similarly for L(θ), by the M -Lipschitz continuity of ℓ. For any fixed ϵ > 0, we thus have
s
2M 2 r2
 
 r 2
|Ln (θ0 ) − L(θ0 )| ≤ 2M ϵ +
b d log 1 + + log
3n ϵ δ

rd
simultaneously for all θ0 ∈ Θ, with probability at least 1 − δ. By choosing ϵ = n we obtain
that with probability at least 1 − δ,
s
2 r2
 
2M rd 8M  n  2
sup |L
b n (θ) − L(θ)| ≤ + d log 1 + + log .
θ∈Θ n 3n d δ
q
Thus, roughly, with high probability we have |L
b n (θ) − L(θ)| ≤ O(1)M r d
n log nd for all θ. 3

On the one hand, the result in Example 6.2.3 is satisfying: it applies to any Lipschitz function
and provides a uniform bound. On the other hand, when we compare to the results achievable for

136
Lexture Notes on Statistics and Information Theory John Duchi

specially structured linear function classes, then applying Rademacher complexity bounds—such
as Proposition 5.2.9 and Example 5.2.10—we have somewhat weaker results, in that they depend
on the dimension explicitly, while the Rademacher bounds do not exhibit this explicit dependence.
This means they can potentially apply in infinite dimensional spaces that Example 6.2.3 cannot.
We will give an example presently showing how to address some of these issues.

6.2.1 Relative bounds


In many cases, it is useful to have bounds that provide somewhat finer control than the bounds
we have presented. Recall from our discussion of sub-Gaussian and sub-exponential random vari-
ables, especially the Bennett and Bernstein-type inequalities (Proposition 4.1.21), that if a random
variable X satisfies |X| ≤ b but Var(X) ≤ σ 2 ≪ b2 , then X concentrates more quickly about
its mean than the convergence provided by naive application of sub-Gaussian concentration with
sub-Gaussian parameter b2 /8. To that end, we investigate an alternative to Theorem 6.2.1 that
allows somewhat sharper control.
The approach is similar to our derivation in Theorem 6.2.1, where we show that the moment
generating function of a quantity like Pn f − P f is small (Eq. (6.2.1)) and then relate this—via the
Donsker-Varadhan change of measure in Theorem 6.1.1—to the quantities we wish to control. In
the next proposition, we provide relative bounds on the deviations of functions from their means.
To make this precise, let F be a collection of functions f : X → R, and let σ 2 (f ) := Var(f (X)) be
the variance of functions in F. We assume the class satisfies the Bernstein condition (4.1.7) with
parameter b, that is,
h i k!
E (f (X) − P f )k ≤ σ 2 (f )bk−2 for k = 3, 4, . . . . (6.2.3)
2
This says that the second moment of functions f ∈ F bounds—with the additional boundedness-
type constant b—the higher moments of functions in f . We then have the following result.
Proposition 6.2.4. Let F be a collection of functions f : X → R satisfying the Bernstein condi-
1
tion (6.2.3). Then for any |λ| ≤ 2b , with probability at least 1 − δ,
Z Z Z  
1 1
λ P f dπ(f ) − λ2 σ 2 (f )dπ(f ) ≤ λ Pn f dπ(f ) + Dkl (π||π0 ) + log
n δ
simultaneously for all π ∈ Π.
Proof We begin with an inequality on the moment generating function of random variables
satisfying the Bernstein condition (4.1.7), that is, that |E[(X − µ)k ]| ≤ k! 2 k−2 for k ≥ 2. In this
2σ b
case, Lemma 4.1.20 implies that
E[eλ(X−µ) ] ≤ exp(λ2 σ 2 )
for |λ| ≤ 1/(2b). As a consequence, for any f in our collection F, we see that if we define
∆n (f, λ) := λ Pn f − P f − λσ 2 (f ) ,
 

we have that
E[exp(n∆n (f, λ))] = E[exp(λ(f (X) − P f ) − λ2 σ 2 (f ))]n ≤ 1
1
for all n, f ∈ F, and |λ| ≤ 2b .Then, for any fixed measure π0 on F, Markov’s inequality implies
that Z 
1
P exp(n∆n (f, λ))dπ0 (f ) ≥ ≤ δ. (6.2.4)
δ

137
Lexture Notes on Statistics and Information Theory John Duchi

Now, as in the proof of Theorem 6.2.1, we use the Donsker-Varadhan Theorem 6.1.1 (change of
measure), which implies that
Z Z
n ∆n (f, λ)dπ(f ) ≤ Dkl (π||π0 ) + log exp(n∆n (f, λ))dπ0 (f )

for all distributions π. Using inequality (6.2.4), we obtain that with probability at least 1 − δ,
Z  
1 1
∆n (f, λ)dπ(f ) ≤ Dkl (π||π0 ) + log
n δ

for all π. As this holds for any fixed |λ| ≤ 1/(2b), this gives the desired result by rearranging.

We would like to optimize over the bound in Proposition 6.2.4 by choosing the “best” λ. If we
could choose the optimal λ, by rearranging Proposition 6.2.4 we would obtain the bound
 
2 1 h 1i
Eπ [P f ] ≤ Eπ [Pn f ] + inf λEπ [σ (f )] + Dkl (π||π0 ) + log
λ>0 nλ δ
r
Eπ [σ 2 (f )] h 1 i
= Eπ [Pn f ] + 2 Dkl (π||π0 ) + log
n δ
simultaneously for all π, with probability at least 1−δ. The problem with this approach is two-fold:
first, we cannot arbitrarily choose λ in Proposition 6.2.4, and second, the bound above depends on
the unknown population variance σ 2 (f ). It is thus of interest to understand situations in which
we can obtain similar guarantees, but where we can replace unknown population quantities on the
right side of the bound with known quantities.
To that end, let us consider the following condition, a type of relative error condition related
to the Bernstein condition (4.1.7): for each f ∈ F,

σ 2 (f ) ≤ bP f. (6.2.5)

This condition is most natural when each of the functions f take nonnegative values—for example,
when f (X) = ℓ(θ, X) for some loss function ℓ and parameter θ of a model. If the functions f are
nonnegative and upper bounded by b, then we certainly have σ 2 (f ) ≤ E[f (X)2 ] ≤ bE[f (X)] = bP f ,
so that Condition (6.2.5) holds. Revisiting Proposition 6.2.4, we rearrange to obtain the following
theorem.

Theorem 6.2.5. Let F be a collection of functions satisfying the Bernstein condition (6.2.3) as in
Proposition 6.2.4, and in addition, assume the variance-bounding condition (6.2.5). Then for any
1
0 ≤ λ ≤ 2b , with probability at least 1 − δ,

λb 1 1h 1i
Eπ [P f ] ≤ Eπ [Pn f ] + Eπ [Pn f ] + Dkl (π||π0 ) + log
1 − λb λ(1 − λb) n δ

for all π.

Proof We use condition (6.2.5) to see that

λEπ [P f ] − λ2 bEπ [P f ] ≤ λEπ [P f ] − λ2 Eπ [σ 2 (f )],

138
Lexture Notes on Statistics and Information Theory John Duchi

apply Proposition 6.2.4, and divide both sides of the resulting inequality by λ(1 − λb).

To make this uniform in λ, thus achieving a tighter bound (so that we need not pre-select λ),
1 λb
we choose multiple values of λ and apply a union bound. To that end, let 1 + η = 1−λb , or η = 1−λb
1 (1+η)2
and λb(1−λb) = η , so that the inequality in Theorem 6.2.1 is equivalent to

(1 + η)2 b h 1i
Eπ [P f ] ≤ Eπ [Pn f ] + ηEπ [Pn f ] + Dkl (π||π0 ) + log .
η n δ
Using that our choice of η ∈ [0, 1], this implies
1 bh 1 i 3b h 1i
Eπ [P f ] ≤ Eπ [Pn f ] + ηEπ [Pn f ] + Dkl (π||π0 ) + log + Dkl (π||π0 ) + log .
ηn δ n δ
Now, take η1 = 1/n, . . . , ηn = 1. Then by optimizing over η ∈ {η1 , . . . , ηn } (which is equivalent, to
within a 1/n factor, to optimizing over 0 < η ≤ 1) and applying a union bound, we obtain
Corollary 6.2.6. Let the conditions of Theorem 6.2.5 hold. Then with probability at least 1 − δ,
r
bEπ [Pn f ] h ni 1  h n i
Eπ [P f ] ≤ Eπ [Pn f ] + 2 Dkl (π||π0 ) + log + Eπ [Pn f ] + 5b Dkl (π||π0 ) + log ,
n δ n δ
simultaneously for all π on F.
Proof By a union bound, we have
1 bh n i 3b h ni
Eπ [P f ] ≤ Eπ [Pn f ] + ηEπ [Pn f ] + Dkl (π||π0 ) + log + Dkl (π||π0 ) + log
ηn δ n δ

for each η ∈ {1/n, . . . , 1}. We consider two cases. In the first, assume that Eπ [Pn f ] ≤ nb (Dkl (π||π0 )+
log nδ . Then taking η = 1 above evidently gives the result. In the second, we have Eπ [Pn f ] >
b n
n (Dkl (π||π0 ) + log δ ), and we can set
s
b n
n (Dkl (π||π0 ) + log δ )
η⋆ = ∈ (0, 1).
Eπ [Pn f ]
1
Choosing η to be the smallest value ηk in {η1 , . . . , ηn } with ηk ≥ η⋆ , so that η⋆ ≤ η ≤ η⋆ + n then
implies the claim in the corollary.

6.2.2 A large-margin guarantee


Let us revisit the loss minimization approaches central to Section 5.2 and Example 6.2.3 in the
context of Corollary 6.2.6. We will investigate an approach to achieve convergence guarantees that
are (nearly) independent of dimension, focusing on 0-1 losses in a binary classification problem.
Consider a binary classification problem with data (x, y) ∈ Rd × {±1}, where we make predictions
⟨θ, x⟩ (or its sign), and for a margin penalty γ ≥ 0 we define the loss

ℓγ (θ; (x, y)) = 1 {⟨θ, x⟩y ≤ γ} .

139
Lexture Notes on Statistics and Information Theory John Duchi

We call the quantity ⟨θ, x⟩y the margin of θ on the pair (x, y), noting that when the margin is
large, ⟨θ, x⟩ has the same sign as y and is “confident” (i.e. far from zero). For shorthand, let us
define the expected and empirical losses at margin γ by

Lγ (θ) := P ℓγ (θ; (X, Y )) and L


b γ (θ) := Pn ℓγ (θ; (X, Y )).

Consider the following scenario: the data x lie in a ball of radius b, so that ∥x∥2 ≤ b; note that
the losses ℓγ and ℓ0 satisfy the Bernstein (6.2.3) and self-bounding (6.2.5) conditions with constant
1 as they take values in {0, 1}. We then have the following proposition.

Proposition 6.2.7. Let the above conditions on the data (x, y) hold and let the margin γ > 0 and
radius r < ∞. Then with probability at least 1 − δ,
√ rb log nδ p r2 b2 log nδ
 
1
P (⟨θ, X⟩Y ≤ 0) ≤ 1 + Pn (⟨θ, X⟩Y ≤ γ) + 8 √ Pn (⟨θ, X⟩Y ≤ γ) + C
n γ n γ2n

simultaneously for all ∥θ∥2 ≤ r, where C is a numerical constant independent of the problem
parameters.

Proposition 6.2.7 provides a “dimension-free” guarantee—it depends only on the ℓ2 -norms ∥θ∥2
and ∥x∥2 —so that it can apply equally in infinite dimensional spaces. The key to the inequality
is that if we can find a large margin predictor—for example, one achieved by a support vector
machine or, more broadly, by minimizing a convex loss of the form
n
1X
minimize ϕ(⟨Xi , θ⟩Yi )
∥θ∥2 ≤r n
i=1

for some decreasing convex ϕ : R → R+ , e.g. ϕ(t) = [1 − t]+ or ϕ(t) = log(1 + e−t )—then we get
strong generalization performance guarantees relative to the empirical margin γ. As one particular
instantiation of this approach, suppose we can obtain a perfect classifier with positive margin: a
vector θ with ∥θ∥2 ≤ r such that ⟨θ, Xi ⟩Yi ≥ γ for each i = 1, . . . , n. Then Proposition 6.2.7
guarantees that
r2 b2 log nδ
P (⟨θ, X⟩Y ≤ 0) ≤ C
γ2n
with probability at least 1 − δ.
Proof Let π0 be N(0, τ 2 I) for some τ > 0 to be chosen, and let π be N(θ,
b τ 2 I) for some θb ∈ Rd
satisfying ∥θ∥
b 2 ≤ r. Then Corollary 6.2.6 implies that

Eπ [Lγ (θ)]
s
Eπ [L
b γ (θ)] h ni 1  b γ (θ)] + C Dkl (π||π0 ) + log n
h i
≤ Eπ [L
b γ (θ)] + 2 Dkl (π||π0 ) + log + Eπ [L
n δ n δ
s
b γ (θ)] h r2
Eπ [L ni 1
 h r2 ni

≤ Eπ [Lγ (θ)] + 2
b + log + Eπ [Lγ (θ)] + C
b + log
n 2τ 2 δ n 2τ 2 δ

simultaneously for all θb satisfying ∥θ∥


b 2 ≤ r with probability at least 1 − δ, where we have used that
2
Dkl N(θ, τ 2 I)||N(0, τ 2 I) = ∥θ∥2 /(2τ 2 ).

140
Lexture Notes on Statistics and Information Theory John Duchi

Let us use the margin assumption. Note that if Z ∼ N(0, τ 2 I), then for any fixed θ0 , x, y we
have
ℓ0 (θ0 ; (x, y)) − P(Z ⊤ x ≥ γ) ≤ E[ℓγ (θ0 + Z; (x, y))] ≤ ℓ2γ (θ0 ; (x, y)) + P(Z ⊤ x ≥ γ)
where the middle expectation is over Z ∼ N(0, τ 2 I). Using the τ 2 ∥x∥22 -sub-Gaussianity of Z ⊤ x, we
can obtain immediately that if ∥x∥2 ≤ b, we have
γ2 γ2
   
ℓ0 (θ0 ; (x, y)) − exp − 2 2 ≤ E[ℓγ (θ0 + Z; (x, y))] ≤ ℓ2γ (θ0 ; (x, y)) + exp − 2 2 .
2τ b 2τ b
Returning to our earlier bound, we evidently have that if ∥x∥2 ≤ b for all x ∈ X , then with
probability at least 1 − δ, simultaneously for all θ ∈ Rd with ∥θ∥2 ≤ r,
s

γ 2
 b 2γ (θ) + exp(− γ22 2 ) h r2
L ni
2τ b
L0 (θ) ≤ L2γ (θ) + 2 exp − 2 2 + 2
b + log
2τ b n 2τ 2 δ
2 2
   
1 b γ h r n i
+ L2γ (θ) + exp − 2 2 + C + log .
n 2τ b 2τ 2 δ
2
Setting τ 2 = 2b2γlog n , we immediately see that for any choice of margin γ > 0, we have with
probability at least 1 − δ that
s
2b 1 hb b ih r2 b2 log n ni
L0 (θ) ≤ Lb 2γ (θ) + +2 L2γ (θ) + + log
n n n 2γ 2 δ
2 2
 
1 b 1 h r b log n n i
+ L2γ (θ) + + C 2
+ log
n n 2γ δ
for all ∥θ∥2 ≤ r.
Rewriting (replacing 2γ with γ) and recognizing that with no loss of generality we may take γ
such that rb ≥ γ gives the claim of the proposition.

6.2.3 A mutual information bound


An alternative perspective of the PAC-Bayesian bounds that Theorem 6.2.1 gives is to develop
bounds based on mutual information, which is also central to the interactive data analysis set-
ting in the next section. We present a few results along these lines here. Assume the setting of
Theorem 6.2.1, so that F consists of σ 2 -sub-Gaussian functions. Let us assume the following ob-
iid
servational model: we observe X1n ∼ P , and then conditional on the sample X1n , draw a (random)
function F ∈ F following the distribution π(· | X1n ). Assuming the prior π0 is fixed, Theorem 6.2.1
guarantees that with probability at least 1 − δ over X1n ,
8σ 2
 
2 n n 2
E[(Pn F − P F ) | X1 ] ≤ Dkl (π(· | X1 )||π0 ) + log ,
3n δ
where the expectation is taken over F ∼ π(· | X1n ), leaving the sample fixed. Now, consider choosing
π0 to be the average over all samples X1n of π, that is, π0 (·) = EP [π(· | X1n )], the expectation taken
iid
over X1n ∼ P . Then by definition of mutual information,
I(F ; X1n ) = EP [Dkl (π(· | X1n )||π0 )] ,

141
Lexture Notes on Statistics and Information Theory John Duchi

and by Markov’s inequality we have


1
P(Dkl (π(· | X1n )||π0 ) ≥ K · I(F ; X1n )) ≤
K
for all K ≥ 0. Combining these, we obtain the following corollary.

Corollary 6.2.8. Let F be chosen according to any distribution π(· | X1n ) conditional on the sample
iid
X1n . Then with probability at least 1 − δ0 − δ1 over the sample X1n ∼ P ,

8σ 2 I(F ; X1n )
 
2
E[(Pn F − P F )2 | X1n ] ≤ + log .
3n δ0 δ1

This corollary shows that if we have any procedure—say, a learning procedure or otherwise—
that limits the information between a sample X1n and an output F , then we are guaranteed that
F generalizes. Tighter analyses of this are possible, though not our focus here, just that already
there should be an inkling that limiting information between input samples and outputs may be
fruitful.

6.3 Interactive data analysis


A major challenge in modern data analysis is that analyses are often not the classical statistics and
scientific method setting. In the scientific method—forgive me for being a pedant—one proposes
a hypothesis, the status quo or some other belief, and then designs an experiment to falsify that
hypothesis. Then, upon performing the experiment, there are only two options: either the experi-
mental results contradict the hypothesis (that is, we must reject the null) so that the hypothesis is
false, or the hypothesis remains consistent with available data. In the classical (Fisherian) statis-
tics perspective, this typically means that we have a single null hypothesis H0 before observing a
sample, we draw a sample X ∈ X , and then for some test statistic T : X → R with observed value
tobserved = T (X), we compute the probability under the null of observing something as extreme as
what we observed, that is, the p-value p = PH0 (T (X) ≥ tobserved ).
Yet modern data analyses are distant from this pristine perspective for many reasons. The
simplest is that we often have a number of hypotheses we wish to test, not a single one. For example,
in biological applications, we may wish to investigate the associations between the expression of
number of genes and a particular phenotype or disease; each gene j then corresponds to a null
hypothesis H0,j that gene j is independent of the phenotype. There are numerous approaches to
addressing the challenges associated with such multiple testing problems—such as false discovery
rate control, familywise error rate control, and others—with whole courses devoted to the challenges.
Even these approaches to multiple testing and high-dimensional problems do not truly capture
modern data analyses, however. Indeed, in many fields, researchers use one or a few main datasets,
writing papers and performing multiple analyses on the same dataset. For example, in medicine,
the UK Biobank dataset [174] has several thousand citations (as of 2023), many of which build
on one another, with early studies coloring the analyses in subsequent studies. Even in situations
without a shared dataset, analyses present researchers with huge degrees of freedom and choice.
A researcher may study a summary statistic of his or her sampled data, or a plot of a few simple
relationships, performing some simple data exploration—which statisticians and scientists have
advocated for 50 years, dating back at least to John Tukey!—but this means that there are huge
numbers of potential comparisons a researcher might make (that he or she does not). This “garden

142
Lexture Notes on Statistics and Information Theory John Duchi

of forking paths,” as Gelman and Loken [100] term it, causes challenges even when researchers are
not “p-hacking” or going on a “fishing expedition” to try to find publishable results. The problem
in these studies and approaches is that, because we make decisions that may, even only in a small
way, depend on the data observed, we have invalidated all classical statistical analyses.
To that end, we now consider interactive data analyses, where we perform data analyses se-
quentially, computing new functions on a fixed sample X1 , . . . , Xn after observing some initial
information about the sample. The starting point of our approach is similar to our analysis of
PAC-Bayesian learning and generalization: we observe that if the function we decide to compute
on the data X1n is chosen without much information about the data at hand, then its value on the
sample should be similar to its values on the full population. This insight dovetails with what we
have seen thus far, that appropriate “stability” in information can be useful and guarantee good
future performance.

6.3.1 The interactive setting


We do not consider the interactive data analysis setting in full, rather, we consider a stylized
approach to the problem, as it captures many of the challenges while being broad enough for
different applications. In particular, we focus on the statistical queries setting, where a data
analyst wishes to evaluate expectations
EP [ϕ(X)] (6.3.1)
iid
of various functionals ϕ : X → R under the population P using a sample X1n ∼ P . Certainly,
numerous problems problems are solvable using statistical queries (6.3.1). Means use ϕ(x) = x,
while we can compute variances using the two statistical queries ϕ1 (x) = x and ϕ2 (x) = x2 , as
Var(X) = EP [ϕ2 (X)] − EP [ϕ1 (X)]2 .
Classical algorithms for the statistical query problem simply return sample means Pn ϕ :=
1 Pn
n i=1 ϕ(Xi ) given a query ϕ : X → R. When the number of queries to be answered is not chosen
adaptively, this means we can typically answer a large number relatively accurately; indeed, if we
have a finite collection Φ of σ 2 -sub-Gaussian ϕ : X → R, then we of course have
r !
2σ 2 2
P max |Pn ϕ − P ϕ| ≥ (log(2|Φ|) + t) ≤ e−t for t ≥ 0
ϕ∈Φ n

by Corollary 4.1.10 (sub-Gaussian concentration) and a union bound. Thus, so long as |Φ| is not
exponential in the sample size n, we expect uniformly high accuracy.

Example 6.3.1 (Risk minimization via statistical queries): Suppose that we are in the loss-
minimization setting (5.2.2), where the losses ℓ(θ, Xi ) are convex and differentiable in θ. Then
gradient descent applied to Lb n (θ) = Pn ℓ(θ, X) will converge to a minimizing value of Lb n . We
can evidently implement gradient descent by a sequence of statistical queries ϕ(x) = ∇θ ℓ(θ, x),
iterating
θ(k+1 ) = θ(k) − αk Pn ϕ(k) , (6.3.2)
where ϕ(k) = ∇θ ℓ(θ(k) , x) and αk is a stepsize. 3

One issue with the example (6.3.1) is that we are interacting with the dataset, because each
sequential query ϕ(k) depends on the previous k − 1 queries. (Our results on uniform convergence
of empirical functionals and related ideas address many of these challenges, so that the result of
the process (6.3.2) will be well-behaved regardless of the interactivity.)

143
Lexture Notes on Statistics and Information Theory John Duchi

We consider an interactive version of the statistical query estimation problem. In this version,
there are two parties: an analyst (or statistician or learner), who issues queries ϕ : X → R, and
a mechanism that answers the queries to the analyst. We index our functionals ϕ by t ∈ T for a
(possibly infinite) set T , so we have a collection {ϕt }t∈T . In this context, we thus have the following
scheme:
Input: Sample X1n drawn i.i.d. P , collection {ϕt }t∈T of possible queries
Repeat: for k = 1, 2, . . .

i. Analyst chooses index Tk ∈ T and query ϕ := ϕTk

ii. Mechanism responds with answer Ak approximating P ϕ = EP [ϕ(X)] using X1n

Figure 6.1: The interactive statistical query setting

Of interest in the iteration 6.1 is that we interactively choose T1 , T2 , . . . , Tk , where the choice Ti
may depend on our approximations of EP [ϕTj (X)] for j < i, that is, on the results of our previous
queries. Even more broadly, the analyst may be able to choose the index Tk in alternative ways
depending on the sample X1n , and our goal is to still be able to accurately compute expectations
P ϕT = EP [ϕT (X)] when the index T may depend on X1n . The setting in Figure 6.1 clearly breaks
with the classical statistical setting in which an analysis is pre-specified before collecting data, but
more closely captures modern data exploration practices.

6.3.2 Second moment errors and mutual information


The starting point of our derivation is the following result, which follows from more or less identical
arguments to those for our PAC-Bayesian bounds earlier.
Theorem 6.3.2. Let {ϕt }t∈T be a collection of σ 2 -sub-Gaussian functions ϕt : X → R. Then for
any random variable T and any λ > 0,
 
2 1 n 1  2

E[(Pn ϕT − P ϕT ) ] ≤ I(X1 ; T ) − log 1 − 2λσ /n +
λ 2
and r
2σ 2
|E[Pn ϕT ] − E[P ϕT ]| ≤ I(X1n ; T )
n
where the expectations are taken over T and the sample X1n .
Proof The proof is similar to that of our first basic PAC-Bayes result in Theorem 6.2.1. Let
us assume w.l.o.g. that P ϕt = 0 for all t ∈ T , noting that then Pn ϕt is σ 2 /n-sub-Gaussian. We
−1/2
prove the first result first. Lemma 6.2.2 implies that E[exp(λ(Pn ϕt )2 )] ≤ 1 − 2λσ 2 /n + for each


t ∈ T . As a consequence, we obtain via the Donsker-Varadhan equality (Theorem 6.1.1) that


Z  (i)  Z 
2 2
λE (Pn ϕt ) dπ(t) ≤ E[Dkl (π||π0 )] + E log exp(λ(Pn ϕt ) )dπ0 (t)
(ii)
Z 
2
≤ E[Dkl (π||π0 )] + log E exp(λ(Pn ϕt ) )dπ0 (t)
(iii) 1
log 1 − 2λσ 2 /n +
 
≤ E[Dkl (π||π0 )] −
2

144
Lexture Notes on Statistics and Information Theory John Duchi

for all distributions π on T , which may depend on Pn , where the expectation E is taken over the
iid
sample X1n ∼ P . (Here inequality (i) is Theorem 6.1.1, inequality (ii) is Jensen’s inequality, and
inequality (iii) is Lemma 6.2.2.) Now, let π0 be the marginal distribution on T (marginally over
all observations X1n ), and let π denote the posterior of T conditional on the sample X1n . Then
E[Dkl (π||π0 )] = I(X1n ; T ) by definition of the mutual information, giving the bound on the squared
error.
For the second result, note that the Donsker-Varadhan equality implies

λ2 σ 2
Z  Z
λE Pn ϕt dπ(t) ≤ E[Dkl (π||π0 )] + log E[exp(λPn ϕt )]dπ0 (t) ≤ I(X1n ; T ) + .
2n
p
Dividing both sides by λ gives E[Pn ϕT ] ≤ 2σ 2 I(X1n ; T )/n, and performing the same analysis with
−ϕT gives the second result of the theorem.

The key in the theorem is that if the mutual information—the Shannon information—I(X; T )
between the sample X and T is small, then the expected squared error can be small. To make this
n
a bit clearer, let us choose values for λ in the theorem; taking λ = 2eσ 2 gives the following corollary.

Corollary 6.3.3. Let the conditions of Theorem 6.3.2 hold. Then

2eσ 2 5σ 2
E[(Pn ϕT − P ϕT )2 ] ≤ I(X1n ; T ) + .
n 4n
Consequently, if we can limit the amount of information any particular query T (i.e., ϕT ) contains
about the actual sample X1n , then guarantee reasonably high accuracy in the second moment errors
(Pn ϕT − P ϕT )2 .

6.3.3 Limiting interaction in interactive analyses


Let us now return to the interactive data analysis setting of Figure 6.1, where we recall the stylized
application of estimating mean functionals P ϕ for ϕ ∈ {ϕt }t∈T . To motivate a more careful ap-
proach, we consider a simple example to show the challenges that may arise even with only a single
“round” of interactive data analysis. Naively answering queries accurately—using the mechanism
Pn ϕ that simply computes the sample average—can easily lead to problems:

Example 6.3.4 (A stylized correlation analysis): Consider the following stylized genetics
experiment. We observe vectors X ∈ {−1, 1}k , where Xj = 1 if gene j is expressed and −1
otherwise. We also observe phenotypes Y ∈ {−1, 1}, where Y = 1 indicates appearance of
the phenotype. In our setting, we will assume that the vectors X are uniform on {−1, 1}k
and independent of Y , but an experimentalist friend of ours wishes to know if there exists a
vector v with ∥v∥2 = 1 such that the correlation between v T X and Y is high, meaning that
v T X is associated with Y . In our notation here, we have index set {v ∈ Rk | ∥v∥2 = 1}, and
by Example 4.1.6, Hoeffding’s lemma, and the independence of the coordinates of X we have
that v T XY is ∥v∥22 /4 = 1/4-sub-Gaussian. Now, we recall the fact that if Zj , j = 1, . . . , k, are
σ 2 -sub-Gaussian, then for any p ≥ 1, we have

E[max |Zj |p ] ≤ (Cpσ 2 log k)p/2


j

145
Lexture Notes on Statistics and Information Theory John Duchi

for a numerical constant C. That is, powers of sub-Gaussian maxima grow at most logarith-
mically. Indeed, by Theorem 4.1.11, we have for any q ≥ 1 by Hölder’s inequality that
X 1/q
p pq
E[max |Zj | ] ≤ E |Zj | ≤ k 1/q (Cpqσ 2 )p/2 ,
j
j

and setting q = log k gives the inequality. Thus, we see that for any a priori fixed v1 , . . . , vk , vk+1 ,
we have
log k
E[max(vjT (Pn Y X))2 ] ≤ O(1) .
j n
If instead we allow a single interaction, the problem is different. We issue queries
associated with v = e1 , . . . , ek , the k standard basis vectors; then we simply set Vk+1 =
Pn Y X/ ∥Pn Y X∥2 . Then evidently
k
T
E[(Vk+1 (Pn Y X))2 ] = E[∥Pn Y X∥22 ] = ,
n
which is exponentially larger than in the non-interactive case. That is, if an analyst is allowed
to interact with the dataset, he or she may be able to discover very large correlations that are
certainly false in the population, which in this case has P XY = 0. 3

Example 6.3.4 shows that, without being a little careful, substantial issues may arise in interac-
tive data analysis scenarios. When we consider our goal more broadly, which is to be able to provide
accurate approximations to P ϕ for queries ϕ chosen adaptively for any population distribution P
and ϕ : X → [−1, 1], it is possible to construct quite perverse situations, where if we compute
sample expectations Pn ϕ exactly, one round of interaction is sufficient to find a query ϕ for which
Pn ϕ − P ϕ ≥ 1.
Example 6.3.5 (Exact query answering allows arbitrary corruption): Suppose we draw a
iid
sample X1n of size n on a sample space X = [m] with Xi ∼ Uniform([m]), where m ≥ 2n. Let
Φ be the collection of all functions ϕ : [m] → [−1, 1], so that P(|Pn ϕ − P ϕ| ≥ t) ≤ exp(−nt2 /2)
for any fixed ϕ. Suppose that in the interactive scheme in Fig. 6.1, we simply release answers
A = Pn ϕ. Consider the following query:

ϕ(x) = n−x for x = 1, 2, . . . , m.

Then by inspection, we see that


m
X
Pn ϕ = n−j card({Xi | Xi = j})
j=1
1 1 1
= card({Xi | Xi = 1}) + 2 card({Xi | Xi = 1}) + · · · + m card({Xi | Xi = m}).
n n n
It is clear that given Pn ϕ, we can reconstruct the sample counts exactly. Then if we define a
n
second query ϕ2 (x) = 1 for x ∈ X1n and ϕ2 (x) = −1 for x ̸∈ X1n , we see that P ϕ2 ≤ m − 1,
while Pn ϕ2 = 1. The gap is thus
n
E[Pn ϕ2 − P ϕ2 ] ≥ 2 − ≥ 1,
m
which is essentially as bad as possible. 3

146
Lexture Notes on Statistics and Information Theory John Duchi

More generally, when one performs an interactive data analysis (e.g. as in Fig. 6.1), adapting
hypotheses while interacting with a dataset, it is not a question of statistical significance or mul-
tiplicity control for the analysis one does, but for all the possible analyses one might have done
otherwise. Given the branching paths one might take in an analysis, it is clear that we require
some care.
With that in mind, we consider the desiderata for techniques we might use to control information
in the indices we select. We seek some type of stability in the information algorithms provide
to a data analyst—intuitively, if small changes to a sample do not change the behavior of an
analyst substantially, then we expect to obtain reasonable generalization bounds. If outputs of a
particular analysis procedure carry little information about a particular sample (but instead provide
information about a population), then Corollary 6.3.3 suggests that any estimates we obtain should
be accurate.
To develop this stability theory, we require two conditions: first, that whatever quantity we
develop for stability should compose adaptively, meaning that if we apply two (randomized) algo-
rithms to a sample, then if both are appropriately stable, even if we choose the second algorithm
because of the output of the first in arbitrary ways, they should remain jointly stable. Second, our
notion should bound the mutual information I(X1n ; T ) between the sample X1n and T . Lastly, we
remark that this control on the mutual information has an additional benefit: by the data process-
ing inequality, any downstream analysis we perform that depends only on T necessarily satisfies the
same stability and information guarantees as T , because if we have the Markov chain X1n → T → V
then I(X1n ; V ) ≤ I(X1n ; T ).
We consider randomized algorithms A : X n → A, taking values in our index set A, where
A(X1n ) ∈ A is a random variable that depends on the sample X1n . For simplicity in derivation,
we abuse notation in this section, and for random variables X and Y with distributions P and Q
respectively, we denote
Dkl (X||Y ) := Dkl (P ||Q) .
We then ask for a type of leave-one-out stability for the algorithms A, where A is insensitive to the
changes of a single example (on average).
Definition 6.1. Let ε ≥ 0. A randomized algorithm A : X n → A is ε-KL-stable if for each
i ∈ {1, . . . , n} there is a randomized Ai : X n−1 → A such that for every sample xn1 ∈ X n ,
n
1X
Dkl A(xn1 )||Ai (x\i ) ≤ ε.

n
i=1

Examples may be useful to understand Definition 6.1.


Example 6.3.6 (KL-stability in mean estimation: Gaussian noise addition): Suppose we
wish to estimate a mean, and that P xi ∈ [−1, 1] are all real-valued. Then a natural statistic
is to simply compute A(xn1 ) = n1 ni=1 xi . In this case, without randomization, we will have
n n 1 Pn
infinite KL-divergence between A(x1 ) and Ai (x\i ). If instead we set A(x1 ) = n i=1 xi + Z
for Z ∼ N(0, σ 2 ), and similarly Ai = n1 j̸=i xj + Z, then we have (recall Example 2.1.7)
P

n n
1X n
 1 X 1 2 1
Dkl A(x1 )||A(x\i ) = x ≤ 2 2,
n 2nσ 2 n2 i 2σ n
i=1 i=1

so that a the sample mean of a bounded random variable perturbed with Guassian noise is
ε = 2σ12 n2 -KL-stable. 3

147
Lexture Notes on Statistics and Information Theory John Duchi

We can consider other types of noise addition as well.

Example 6.3.7 (KL-stability in mean estimation: Laplace noise addition): Let the conditions
of Example 2.1.7 hold,
P but suppose instead of Gaussian noise we add scaled Laplace noise,
that is, A(xn1 ) = n1 ni=1 xi + Z for Z with density p(z) = 2σ
1
exp(−|z|/σ), where σ > 0. Then
using that if Lµ,σ denotes the Laplace distribution with shape σ and mean µ, with density
1
p(z) = 2σ exp(−|z − µ|/σ), we have
Z |µ1 −µ0 |
1
Dkl (Lµ0 ,σ ||Lµ1 ,σ ) = exp(−z/σ)(|µ1 − µ0 | − z)dz
σ2 0
|µ1 − µ0 |2
 
|µ1 − µ0 | |µ1 − µ0 |
= exp − −1+ ≤ ,
σ σ 2σ 2

we see that in this case the sample mean of a bounded random variable perturbed with Laplace
noise is ε = 2σ12 n2 -KL-stable, where σ is the shape parameter. 3

The two key facts are that KL-stable algorithms compose adaptively and that they bound
mutual information in independent samples.

Lemma 6.3.8. Let A : X n → A0 and A′ : A0 × X → A1 be ε and ε′ -KL-stable algorithms,


respectively. Then the (randomized) composition A′ ◦ A(xn1 ) = A′ (A(xn1 ), xn1 ) is ε + ε′ -KL-stable.
Moreover, the pair (A′ ◦ A(xn1 ), A(xn1 )) is ε + ε′ -KL-stable.

Proof Let Ai and A′i be the promised sub-algorithms in Definition 6.1. We apply the data
processing inequality, which implies for each i that

Dkl A′ (A(xn1 ), xn1 )||A′i (Ai (x\i ), x\i ) ≤ Dkl A′ (A(xn1 ), xn1 ), A(xn1 )||A′i (Ai (x\i ), x\i ), Ai (x\i ) .
 

We require a bit of notational trickery now. Fixing i, let PA,A′ be the joint distribution of
A′ (A(xn1 ), xn1 ) and A(xn1 ) and QA,A′ the joint distribution of A′i (Ai (x\i ), x\i ) and Ai (x\i ), so that
they are both distributions over A1 × A0 . Let PA′ |a be the distribution of A′ (t, xn1 ) and similarly
QA′ |a is the distribution of A′i (t, x\i ). Note that A′ , A′i both “observe” x, so that using the chain
rule (2.1.6) for KL-divergences, we have

Dkl A′ ◦ A, A||A′i ◦ Ai , Ai = Dkl PA,A′ ||QA,A′


 
Z

= Dkl (PA ||QA ) + Dkl PA′ |t ||QA′ |t dPA (t)

= Dkl (A||Ai ) + EA [Dkl A′ (A, xn1 )||A′i (A, xn1 ) ].




Summing this from i = 1 to n yields


n n  Xn 
1X ′ ′
 1X 1
Dkl A (A, x1 )||Ai (A, x1 ) ≤ ε + ε′ ,
′ n ′ n

Dkl A ◦ A||Ai ◦ Ai ≤ Dkl (A||Ai ) + EA
n n n
i=1 i=1 i=1

as desired.

The second key result is that KL-stable algorithms also bound the mutual information of a
random function.

148
Lexture Notes on Statistics and Information Theory John Duchi

Lemma 6.3.9. Let Xi be independent. Then for any random variable A,


n
X n Z
X
I(A; X1n ) Dkl A(xn1 )||Ai (x\i ) dP (xn1 ),

≤ I(A; Xi | X\i ) =
i=1 i=1

where Ai (x\i ) = A(xi−1 n


1 , Xi , xi+1 ) is the random realization of A conditional on X\i = x\i .

Proof Without loss of generality, we assume A and X are both discrete. In this case, we have
n
X n
X
I(A; X1n ) = I(A; Xi | X1i−1 ) = H(Xi | X1i−1 ) − H(Xi | A, X1i−1 ).
i=1 i=1

Now, because the Xi follow a product distribution, H(Xi | X1i−1 ) = H(Xi ), while H(Xi |
A, X1i−1 ) ≥ H(Xi | A, X\i ) because conditioning reduces entropy. Consequently, we have
n
X n
X
I(A; X1n ) ≤ H(Xi ) − H(Xi | A, X\i ) = I(A; Xi | X\i ).
i=1 i=1

To see the final equality, note that


Z
I(A; Xi | X\i ) = I(A; Xi | X\i = x\i )dP (x\i )
X n−1
Z Z
= Dkl (A(xn1 )||A(x1:i−1 , Xi , xi+1:n )) dP (xi )dP (x\i )
X n−1 X

by definition of mutual information as I(X; Y ) = EX [Dkl PY |X ||PY ].

Combining Lemmas 6.3.8 and 6.3.9, we see (nearly) immediately that KL stability implies
a mutual information bound, and consequently even interactive KL-stable algorithms maintain
bounds on mutual information.

Proposition 6.3.10. Let A1 , . . . , Ak be εi -KL-stable procedures, respectively, composed in any


arbitrary sequence. Let Xi be independent. Then
k
1 X
I(A1 , . . . , Ak ; X1n ) ≤ εi .
n
i=1

Proof Applying Lemma 6.3.9,


n k X
n
I(Aj ; Xi | X\i , Aj−1
X X
I(Ak1 ; X1n ) ≤ I(Ak1 ; Xi | X\i ) = 1 ).
i=1 j=1 i=1

Fix an index j and for shorthand, let A = A and A′ = (A1 , . . . , Aj−1 ) be the first j − 1 procedures.
Then expanding the final mutual information term and letting ν denote the distribution of A′ , we
have
Z
I(A; Xi | X\i , A ) = Dkl A(a′ , xn1 )||A(a′ , x\i ) dP (xi | A′ = a′ , x\i )dP n−1 (x\i )dν(a′ | x\i )



149
Lexture Notes on Statistics and Information Theory John Duchi

where A(a′ , xn1 ) is the (random) procedure A on inputs xn1 and a′ , while A(a′ , x\i ) denotes the
(random) procedure A on input a′ , x\i , Xi , and where the ith example Xi follows its disdtribution
conditional on A′ = a′ and X\i = x\i , as in Lemma 6.3.9. We then recognize that for each i, we
have
Z Z  
′ n ′ ′
Dkl A(a , x1 )||A(a , x\i ) dP (xi | a , x\i ) ≤ Dkl A(a′ , xn1 )||A(a
e ′ , x\i ) dP (xi | a′ , x\i )


for any randomized function A, e as the marginal A in the lemma minimizes the average KL-
divergence (recall Exercise 2.15). Now, sum over i and apply the definition of KL-stability as
in Lemma 6.3.8.

6.3.4 Error bounds for a simple noise addition scheme


Based on Proposition 6.3.10, to build an appropriately well-generalizing procedure we must build
a mechanism for the interaction in Fig. 6.1 that maintains KL-stability. Using Example 6.3.6, this
is not challenging for the class of bounded queries. Let Φ = {ϕt }t∈T where ϕt : X → [−1, 1] be
the collection of statistical queries taking values in [−1, 1]. Then based on Proposition 6.3.10 and
Example 6.3.6, the following procedure is stable.

Input: Sample X1n ∈ X n drawn i.i.d. P , collection {ϕt }t∈T of possible queries ϕt : X →
[−1, 1]
Repeat: for k = 1, 2, . . .

i. Analyst chooses index Tk ∈ T and query ϕ := ϕTk

ii. Mechanism draws independent Zk ∼ N(0, σ 2 ) and responds with answer


n
1X
Ak := Pn ϕ + Zk = ϕ(Xi ) + Zk .
n
i=1

Figure 6.2: Sequential Gaussian noise mechanism.

This procedure is evidently KL-stable, and based on Example 6.3.6 and Proposition 6.3.10, we
have that
1 k
I(X1n ; T1 , . . . , Tk , Tk+1 ) ≤ 2 2
n 2σ n
so long as the indices Ti ∈ T are chosen only as functions of Pn ϕ + Zj for j < i, as the classical
information processing inequality implies that
1 1
I(X1n ; T1 , . . . , Tk , Tk+1 ) ≤ I(X1n ; A1 , . . . , Ak )
n n
because we have X1n → A1 → T2 and so on for the remaining indices. With this, we obtain the
following theorem.

150
Lexture Notes on Statistics and Information Theory John Duchi

Theorem 6.3.11. Let the indices Ti , i = 1, . . . , k + 1 be chosen in an arbitrary way using the
procedure 6.2, and let σ 2 > 0. Then
 
2 2ek 10
E max(Aj − P ϕTj ) ≤ 2 2 + + 4σ 2 (log k + 1).
j≤k σ n 4n
p
By inspection, we can optimize over σ 2 by setting σ 2 = k/(log k + 1)/n, which yields the
upper bound p
 
2 10 k(1 + log k)
E max(Aj − P ϕTj ) ≤ + 10 .
j≤k 4n n
Comparing to Example 6.3.4, we see a substantial improvement. While we do not achieve accuracy
scaling with log k, as we would if the queried functionals ϕt were completely independent of the
sample, we see that we achieve mean-squared error of order

k log k
n
for k adaptively chosen queries.
Proof To prove the result, we use a technique sometimes called the monitor technique. Roughly,
the idea is that we can choose the index Tk+1 in any way we desire as long as it is a function of the
answers A1 , . . . , Ak and any other constants independent of the data. Thus, we may choose
Tk+1 := Tk⋆ where k ⋆ = argmax{|Aj − P ϕTj |},
j≤k

as this is a (downstream) function of the k different ε = 2σ12 n2 -KL-stable queries T1 , . . . , Tk . As


a consequence, we have from Corollary 6.3.3 (and the fact that the queries ϕ are 1-sub-Gaussian)
that for T = Tk+1 ,
2e 5 5 ek 5
E[(Pn ϕT − P ϕT )2 ] ≤ I(X1n ; Tk+1 ) + ≤ 2ekε + = 2 2+ .
n 4n 4n σ n 4n
Now, we simply consider the independent noise addition, noting that (a + b)2 ≤ 2a2 + 2b2 for any
a, b ∈ R, so that
   
2 2 2
E max(Aj − P ϕTj ) ≤ 2E[(Pn ϕT − P ϕT ) ] + 2E max{Zj }
j≤k j≤k
2ek 10
≤ 2 2+ + 4σ 2 (log k + 1), (6.3.3)
σ n 4n
where inequality (6.3.3) is the desired result and follows by the following lemma.
Lemma 6.3.12. Let Wj , j = 1, . . . , k be independent N(0, 1). Then E[maxj Wj2 ] ≤ 2(log k + 1).
Proof We assume that k ≥ 3, as the result is trivial otherwise. Using the tail bound for
Gaussians (Mills’s ratio for Gaussians, which is tighter Rthan the standard sub-Gaussian bound)
2 ∞
that P(W ≥ t) ≤ √2πt1
e−t /2 for t ≥ 0 and that E[Z] = 0 P(Z ≥ t)dt for a nonnegative random
variable Z, we obtain that for any t0 ,
Z ∞ Z ∞
2 2
E[max Wj ] = P(max Wj ≥ t)dt ≤ t0 + P(max Wj2 ≥ t)dt
j 0 j t0 j
Z ∞ √ Z ∞
2k 4k
≤ t0 + 2k P(W1 ≥ t)dt ≤ t0 + √ e−t/2 dt = t0 + √ e−t0 /2 .
t0 2π t0 2π

151
Lexture Notes on Statistics and Information Theory John Duchi


Setting t0 = 2 log(4k/ 2π) gives E[maxj Wj2 ] ≤ 2 log k + log √42π + 1.

6.4 Bibliography and further reading


PAC-Bayes techniques originated with work of David McAllester [144, 145, 146], and we remark
on his excellently readable tutorial [147]. The particular approaches we take to our proofs in
Section 6.2 follow Catoni [48] and McAllester [146]. The PAC-Bayesian bounds we present, that
simultaneously for any distribution π on F, if F ∼ π then
 
1 1
E[(Pn F − P F )2 | X1n ] ≲ Dkl (π||π0 ) + log
n δ
with probability at least 1 − δ suggest that we can optimize them by choosing π carefully. For
example, in the context of learning a statistical model parameterized by θ ∈ Θ with losses ℓ(θ; x, y),
it is natural to attempt to find π minimizing
r
1
Eπ [Pn ℓ(θ; X, Y ) | Pn ] + C Dkl (π||π0 )
n
in π, where the expectation is taken over θ ∼ π. If this quantity has optimal value ϵ⋆n ,qthen one is