Lecture Notes
Lecture Notes
John Duchi
1
Lexture Notes on Statistics and Information Theory John Duchi
4 Concentration Inequalities 67
4.1 Basic tail inequalities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
4.1.1 Sub-Gaussian random variables . . . . . . . . . . . . . . . . . . . . . . . . . . 69
4.1.2 Sub-exponential random variables . . . . . . . . . . . . . . . . . . . . . . . . 73
4.1.3 Orlicz norms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
4.1.4 First applications of concentration: random projections . . . . . . . . . . . . 80
4.1.5 A second application of concentration: codebook generation . . . . . . . . . . 81
4.2 Martingale methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
4.2.1 Sub-Gaussian martingales and Azuma-Hoeffding inequalities . . . . . . . . . 84
4.2.2 Examples and bounded differences . . . . . . . . . . . . . . . . . . . . . . . . 85
4.3 Matrix concentration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
4.4 Technical proofs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
4.4.1 Proof of Theorem 4.1.11 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
4.4.2 Proof of Theorem 4.1.15 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
4.4.3 Proof of Theorem 5.1.6 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
4.4.4 Proof of Proposition 4.3.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
4.5 Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
4.6 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
2
Lexture Notes on Statistics and Information Theory John Duchi
3
Lexture Notes on Statistics and Information Theory John Duchi
9 Minimax lower bounds: the Le Cam, Fano, and Assouad methods 217
9.1 Basic framework and minimax risk . . . . . . . . . . . . . . . . . . . . . . . . . . . . 217
9.2 Preliminaries on methods for lower bounds . . . . . . . . . . . . . . . . . . . . . . . 219
9.2.1 From estimation to testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . 220
9.2.2 Inequalities between divergences and product distributions . . . . . . . . . . 221
9.2.3 Metric entropy and packing numbers . . . . . . . . . . . . . . . . . . . . . . . 223
9.3 Le Cam’s method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 224
9.4 Fano’s method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 226
9.4.1 The classical (local) Fano method . . . . . . . . . . . . . . . . . . . . . . . . 226
9.4.2 A distance-based Fano method . . . . . . . . . . . . . . . . . . . . . . . . . . 231
9.5 Assouad’s method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 235
9.5.1 Well-separated problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 235
9.5.2 From estimation to multiple binary tests . . . . . . . . . . . . . . . . . . . . . 235
9.5.3 Example applications of Assouad’s method . . . . . . . . . . . . . . . . . . . 237
9.6 Deferred proofs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 239
9.6.1 Proof of Proposition 9.4.6 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 239
9.6.2 Proof of Corollary 9.4.7 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 240
9.6.3 Proof of Lemma 9.5.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 240
9.7 Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 241
9.8 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 241
4
Lexture Notes on Statistics and Information Theory John Duchi
5
Lexture Notes on Statistics and Information Theory John Duchi
6
Lexture Notes on Statistics and Information Theory John Duchi
7
Lexture Notes on Statistics and Information Theory John Duchi
V Appendices 592
8
Lexture Notes on Statistics and Information Theory John Duchi
9
Chapter 1
This book explores some of the (many) connections relating information theory, statistics, computa-
tion, and learning. Signal processing, machine learning, and statistics all revolve around extracting
useful information from signals and data. In signal processing and information theory, a central
question is how to best design signals—and the channels over which they are transmitted—to max-
imally communicate and store information, and to allow the most effective decoding. In machine
learning and statistics, by contrast, it is often the case that nature provides a fixed data distri-
bution, and it is the learner’s or statistician’s goal to recover information about this (unknown)
distribution. Our goal will be to show how a information theoretic perspectives can provide clean
answers about and techniques to perform this recovery.
The discovery of fundamental limits forms a central aspect of information theory: the develop-
ment of results that demonstrate that certain procedures are optimal. Thus, information theoretic
tools allow a characterization of the attainable results in a variety of communication and statis-
tical settings. As we explore in the coming chapters in the context of statistical, inferential, and
machine learning tasks, this allows us to develop procedures whose optimality we can certify—no
better procedure is possible. Such results are useful for a myriad of reasons; we would like to avoid
making bad decisions or false inferences, we may realize a task is impossible, and we can explicitly
calculate the amount of data necessary for solving different statistical problems.
In this context, we provide two main high-level examples, one for each of these tasks.
Example 1.1.1 (Source coding): The source coding, or data compression problem, is to
take information from a source, compress it, decompress it, and recover the original message.
10
Lexture Notes on Statistics and Information Theory John Duchi
Graphically, we have
The question, then, is how to design a compressor (encoder) and decompressor (decoder) that
uses the fewest number of bits to describe a source (or a message) while preserving all the
information, in the sense that the receiver receives the correct message with high probability.
This fewest number of bits is then the information content of the source (signal). 3
Example 1.1.2: The channel coding, or data transmission problem, is the same as the source
coding problem of Example 1.1.1, except that between the compressor and decompressor is a
source of noise, a channel. The graphical representation becomes
Here we investigate the maximum number of bits that may be sent per each channel use in
the sense that the receiver can reconstruct the desired message with low probability of error.
Because the channel introduces noise, we require some redundancy, and information theory
studies the exact amount of redundancy—in the form of additional bits—that must be sent to
allow such reconstruction. 3
X1 ,...,Xn Pb
Source (P ) −→ Compressor → Decompressor → Receiver.
Here, we estimate Pb—an empirical version of the distribution P that is easier to describe than
the original signal X1 , . . . , Xn —with the hope that we learn information about the generating
distribution P , or at least describe it efficiently.
In our analogy with channel coding we can connect to estimation and inference. Consider a
statistical problem in which there exists some unknown function f on a space X that we wish
to estimate, and we are able to observe a noisy version of f (Xi ) for a series of Xi drawn from
a distribution P . Recalling the graphical description of Example 1.1.2, we now have a channel
P (Y | f (X)) that gives us noisy observations of f (X) for each Xi , but we generally no longer
choose the encoder/compressor. That is, we have
X1 ,...,Xn f (X1 ),...,f (Xn ) Y1 ,...,Yn
Source (P ) −→ Compressor −→ Channel P (Y | f (X)) −→ Decompressor.
11
Lexture Notes on Statistics and Information Theory John Duchi
any choice in the design of the compressor f that transforms the original signal X1 , . . . , Xn , which
makes it somewhat different from traditional ideas in information theory. In some cases that we
explore later—such as experimental design, randomized controlled trials, reinforcement learning
and bandits (and associated exploration/exploitation tradeoffs)—we are also able to influence the
compression part of the above scheme.
Example 1.2.1: A classical example of the statistical paradigm in this lens is the usual linear
regression problem. Here the data Xi belong to Rd , and the compression function f (x) = θ⊤ x
for some vector θ ∈ Rd . Then the channel is often of the form
Yi = θ⊤ Xi + εi ,
| {z } |{z}
signal noise
iid
where εi ∼ N(0, σ 2 ) are independent mean zero normal perturbations. Given a sequence of
pairs (Xi , Yi ), we wish to recover the true θ in the linear model.
In active learning or active sensing scenarios, also known as (sequential) experimental
design, we may choose the sequence Xi so as to better explore properties of θ. As one concrete
idea, if we allow infinite power, which in this context corresponds to letting ∥Xi ∥ → ∞—
choosing very “large” vectors xi —then the signal of θ⊤ Xi should swamp any noise and make
estimation easier. 3
12
Lexture Notes on Statistics and Information Theory John Duchi
context, we develop PAC-Bayesian bounds, and we also use the same framework to present tools to
control generalization and convergence in interactive data analyses. These types of analyses reflect
modern statistics, where one performs some type of data exploration before committing to a fuller
analysis, but which breaks classical statistical approaches, because the analysis now depends on
the sample. We provide a treatment of more advanced ideas in Chapter 7, where we develop more
sophisticated concentration results, such as on random matrices, using core ideas from information
theory, which allow us to connect divergence measures to different random processes. Finally, we
provide a chapter (Chapter 8) on disclosure limitation and privacy techniques, all of which repose
on different notions of stability in distribution.
Part II studies fundamental limits, using information-theoretic techniques to derive lower
bounds on the possible rates of convergence for various estimation, learning, and other statistical
problems. Chapter 9 kicks things off by developing the three major methods for lower bounds:
the Assouad, Fano, and Le Cam methods. This chapter shows the basic techniques from which all
the other lower bound ideas follow. At a high level, we might consider it, along with Part I, as
exhibiting the entire object of study of this book: how do distributions get close to one another, and
how can we leverage that closeness? We give a brief treatment of some lower bounding techniques
beyond these approaches in Chapter 10, including applications to certain nonparametric problems,
as well as a few results that move beyond the typical lower bounds, which apply in expectation,
to some that mimic “strong converses” in information theory, meaning that with exceedingly high
probability, one cannot hope to achieve anything better than average case error guarantees.
In modern statistical learning problems, one frequently has concerns beyond just statistical risk,
such as communication or computational cost, or the privacy of study participants. Accordingly,
we develop some of the recent techniques for such problems in Chapter 11 on problems where we
wish to obtain optimality guarnatees simultaneously along many dimensions, connecting to com-
munication complexity ideas from information theory. Chapter 12 provides a bit of a throwback to
estimation with squared error—the most common error metric—introducing the classical statistical
tools we have, but shows a few of the more modern applications of the ideas, which re-appear with
some frequency. Finally, we conclude the discussion of fundamental limits by looking at testing
problems and functional estimation, where one wishes to only estimate a single parameter of a
larger model (Chapter 13). While estimating a single scalar might seem, a priori, to be simpler
than other problems, adequately addressing its complexity requires a fairly nuanced treatment and
the introduction of careful information-theoretic tools.
Part III revisits all of our information theoretic notions from Chapter 2, but instead of simply
giving definitions and a few consequences, provides operational interpretations of the different
information-theoretic quantities, such as entropy. Of course this includes Shannon’s original results
on the relationship between coding and entropy (which we cover in the overview Chapter 2.4.1 on
information theory), but we also provide an interpretation of entropy and information as measures
of uncertainty in statistical experiments and statistical learning, which is a perspective typically
missing from information-theoretic treatments of entropy (Chapter 14). Our treatment shows a
deep connection between entropy and loss functions used for prediction, where a particular duality
allows moving back and forth between them.
We connect these ideas to the problem of calibration in Chapter 15, where we ask that a
prediction model be valid in that, e.g., on 75% of the days the model provides a prediction of 75% of
rain, it rains. We are also able to use these information-theoretic notions of risk, entropy, and losses
to connect to problems in optimization and machine learning. In particular, Chapter 16 explores the
ways that, if instead of fitting a model to some “true” loss we use an easier-to-optimize surrogate,
we essentially lose nothing. This allows us to delineate when (at least in asymptotic senses) it
13
Lexture Notes on Statistics and Information Theory John Duchi
is possible to computationally efficiently learn good predictors and design good experiments in
statistical machine learning problems. Because of the connections with optimization and convex
duality, these chapters repose on a nontrivial foundation of convex analysis; we include Appendices
(Appendix B and C) that provide a fairly comprehensive review of the results we require. For
readers unfamiliar with convex optimization and analysis, I will be the first to admit that these
chapters may be tough going—accordingly, we attempt to delineate the big-picture ideas from the
nitty-gritty technical conditions necessary for the most general results.
Part IV finishes the book with a treatment of stochastic optimization, online game playing,
and minimax problems. Our approach in Chapter 17 takes a modern perspective on stochastic
optimization as minimizing random models of functions, and it includes the “book” proofs of
convergence of the workhorses of modern machine learning optimization. It also leverages the
earlier results on fundamental limits to develop optimality theory for convex optimization in the
same framework. Chapter 18 explores online decision-making problems and, more broadly, problems
that require exploration and exploitation. This includes bandit problems and some basic questions
in causal estimation, where information-theoretic tools allow a clean treatment. The concluding
Chapter 19 revisits Chapter 14 on loss functions and predictions, but considers it more in the
context of particular games between nature and a statistician/learner. Once again leveraging the
perspective on entropy and loss functions we have developed, we are able to provide a generalization
of the celebrated redundancy/capacity theorem from information theory, but recast as a game of
loss minimization against a nature.
14
Chapter 2
In this first introductory chapter, we discuss and review many of the basic concepts of information
theory in effort to introduce them to readers unfamiliar with the tools. Our presentation is relatively
brisk, as our main goal is to get to the meat of the chapters on applications of the inequalities and
tools we develop, but these provide the starting point for everything in the sequel. One of the
main uses of information theory is to prove what, in an information theorist’s lexicon, are known
as converse results: fundamental limits that guarantee no procedure can improve over a particular
benchmark or baseline. We will give the first of these here to preview more of what is to come,
as these fundamental limits form one of the core connections between statistics and information
theory. The tools of information theory, in addition to their mathematical elegance, also come
with strong operational interpretations: they give quite precise answers and explanations for a
variety of real engineering and statistical phenomena. We will touch on one of these here (the
connection between source coding, or lossless compression, and the Shannon entropy), and much
of the remainder of the book will explore more.
2.1.1 Definitions
Here, we provide the basic definitions of entropy, information, and divergence, assuming the random
variables of interest are discrete or have densities with respect to Lebesgue measure.
15
Lexture Notes on Statistics and Information Theory John Duchi
Entropy: We begin with a central concept in information theory: the entropy. Let P be a distri-
bution on a finite (or countable) set X , and let p denote the probability mass function associated
with P . That is, if X is a random variable distributed according to P , then P (X = x) = p(x). The
entropy of X (or of P ) is defined as
X
H(X) := − p(x) log p(x).
x
Because p(x) ≤ 1 for all x, it is clear that this quantity is positive. We will show later that if X
is finite, the maximum entropy distribution on X is the uniform distribution, setting p(x) = 1/|X |
for all x, which has entropy log(|X |).
Later in the class, we provide a number of operational interpretations of the entropy. The
most common interpretation—which forms the beginning of Shannon’s classical information the-
ory [167]—is via the source-coding theorem. We present Shannon’s source coding theorem in
Section 2.4.1, where we show that if we wish to encode a random variable X, distributed according
to P , with a k-ary string (i.e. each entry of the string takes
P on one of k values), then the minimal
expected length of the encoding is given by H(X) = − x p(x) logk p(x). Moreover, this is achiev-
able (to within a length of at most 1 symbol) by using Huffman codes (among many other types of
codes). As an example of this interpretation, we may consider encoding a random variable X with
equi-probable distribution on m items, which has H(X) = log(m). In base-2, this makes sense: we
simply assign an integer to each item and encode each integer with the natural (binary) integer
encoding of length ⌈log m⌉.
We can also define the conditional entropy, which is the amount of information left in a random
variable after observing another. In particular, we define
X X
H(X | Y = y) = − p(x | y) log p(x | y) and H(X | Y ) = p(y)H(X | Y = y),
x y
Example 2.1.2 (Bernoulli random variables): Let h2 (p) = −p log p − (1 − p) log(1 − p) denote
the binary entropy, which is the entropy of a Bernoulli(p) random variable. 3
16
Lexture Notes on Statistics and Information Theory John Duchi
Example 2.1.4 (A random variable with infinite entropy): While most “reasonable” discrete
random variables have finite entropy, it is possible to construct distributions with infinite
entropy. Indeed, let X have p.m.f. on {2, 3, . . .} defined by
∞
A −1
X 1
p(k) = 2 where A = < ∞,
k log k k=2
k log2 k
R∞ Rx
the last sum finite as 2 x log1 α x dx < ∞ if and only if α > 1: for α = 1, we have e 1
t log t =
log log x, while for α > 1, we have
d 1
(log x)1−α = (1 − α)
dx x logα x
R∞ 1 1
so that e t logα t dt = e(1−α) . To see that the entropy is infinite, note that
X log A + log k + 2 log log k X log k
H(X) = A 2 ≥A 2 − C = ∞,
k≥2
k log k k≥2
k log k
KL-divergence: Now we define two additional quantities, which are actually much more funda-
mental than entropy: they can always be defined for any distributions and any random variables,
as they measure distance between distributions. Entropy simply makes no sense for non-discrete
random variables, let alone random variables with continuous and discrete components, though it
proves useful for some of our arguments and interpretations.
Before defining these quantities, we recall the definition of a convex function f : Rk → R as any
bowl-shaped function, that is, one satisfying
f (λx + (1 − λ)y) ≤ λf (x) + (1 − λ)f (y) (2.1.1)
for all λ ∈ [0, 1], all x, y. The function f is strictly convex if the convexity inequality (2.1.1) is
strict for λ ∈ (0, 1) and x ̸= y. We recall a standard result:
Proposition 2.1.5 (Jensen’s inequality). Let f be convex. Then for any random variable X,
f (E[X]) ≤ E[f (X)].
Moreover, if f is strictly convex, then f (E[X]) < E[f (X)] unless X is constant.
Now we may define and provide a few properties of the KL-divergence. Let P and Q be
distributions defined on a discrete set X . The KL-divergence between them is
X p(x)
Dkl (P ||Q) := p(x) log .
q(x)
x∈X
We observe immediately that Dkl (P ||Q) ≥ 0. To see this, we apply Jensen’s inequality (Propo-
sition 2.1.5) to the function − log and the random variable q(X)/p(X), where X is distributed
according to P :
q(X) q(X)
Dkl (P ||Q) = −E log ≥ − log E
p(X) p(X)
X
q(x)
= − log p(x) = − log(1) = 0.
x
p(x)
17
Lexture Notes on Statistics and Information Theory John Duchi
Moreover, as log is strictly convex, we have Dkl (P ||Q) > 0 unless P = Q. Another consequence of
the positivity of the KL-divergence is that whenever the set X is finite with cardinality |X | < ∞,
for any random variable X supported on X we have H(X) ≤ log |X |. Indeed, letting m = |X |, Q
1
be the uniform distribution on X so that q(x) = m , and X have distribution P on X , we have
X p(x) X
0 ≤ Dkl (P ||Q) = p(x) log = −H(X) − p(x) log q(x) = −H(X) + log m, (2.1.2)
x
q(x) x
so that H(X) ≤ log m. Thus, the uniform distribution has the highest entropy over all distributions
on the set X .
Mutual information: Having defined KL-divergence, we may now describe the information
content between two random variables X and Y . The mutual information I(X; Y ) between X and
Y is the KL-divergence between their joint distribution and their products (marginal) distributions.
More mathematically,
X p(x, y)
I(X; Y ) := p(x, y) log . (2.1.3)
x,y
p(x)p(y)
We can rewrite this in several ways. First, using Bayes’ rule, we have p(x, y)/p(y) = p(x | y), so
X p(x | y)
I(X; Y ) = p(y)p(x | y) log
x,y
p(x)
XX X X
=− p(y)p(x | y) log p(x) + p(y) p(x | y) log p(x | y)
x y y x
= H(X) − H(X | Y ).
Similarly, we have I(X; Y ) = H(Y ) − H(Y | X), so mutual information can be thought of as the
amount of entropy removed (on average) in X by observing Y . We may also think of mutual infor-
mation as measuring the similarity between the joint distribution of X and Y and their distribution
when they are treated as independent.
Comparing the definition (2.1.3) to that for KL-divergence, we see that if PXY is the joint
distribution of X and Y , while PX and PY are their marginal distributions (distributions when X
and Y are treated independently), then
Entropies of continuous random variables For continuous random variables, we may define
an analogue of the entropy known as differential entropy, which for a random variable X with
density p is defined by Z
h(X) := − p(x) log p(x)dx. (2.1.4)
18
Lexture Notes on Statistics and Information Theory John Duchi
Note that the differential entropy may be negative—it is no longer directly a measure of the number
of bits required to describe a random variable X (on average), as was the case for the entropy. We
can similarly define the conditional entropy
Z Z
h(X | Y ) = − p(y) p(x | y) log p(x | y)dxdy.
We remark that the conditional differential entropy of X given Y for Y with arbitrary distribution—
so long as X has a density—is
Z
h(X | Y ) = E − p(x | Y ) log p(x | Y )dx ,
where p(x | y) denotes the conditional density of X when Y = y. The KL divergence between
distributions P and Q with densities p and q becomes
Z
p(x)
Dkl (P ||Q) = p(x) log dx,
q(x)
and similarly, we have the analogues of mutual information as
Z
p(x, y)
I(X; Y ) = p(x, y) log dxdy = h(X) − h(X | Y ) = h(Y ) − h(Y | X).
p(x)p(y)
As we show in the next subsection, we can define the KL-divergence between arbitrary distributions
(and mutual information between arbitrary random variables) more generally without requiring
discrete or continuous distributions. Before investigating these issues, however, we present a few
examples. We also see immediately that for X uniform on a set [a, b], we have h(X) = log(b − a).
Example 2.1.6 (Entropy of normal random variables): The differential entropy (2.1.4) of
a normal random variable is straightforward to compute. Indeed, for X ∼ N(µ, σ 2 ) we have
p(x) = √ 1 2 exp(− 2σ1 2 (x − µ)2 ), so that
2πσ
E[(X − µ)2 ]
Z
1 1 1 2 1 2 1
h(X) = − p(x) log − (x − µ) = log(2πσ ) + = log(2πeσ 2 ).
2 2πσ 2 2σ 2 2 2σ 2 2
For a general multivariate Gaussian, where X ∼ N(µ, Σ) for a vector µ ∈ Rn and Σ ≻ 0 with
density p(x) = n/2
1
√ exp(− 21 (x − µ)⊤ Σ−1 (x − µ)), we similarly have
(2π) det(Σ)
1 h i
h(X) = E n log(2π) + log det(Σ) + (X − µ)⊤ Σ−1 (X − µ)
2
n 1 1 n 1
= log(2π) + log det(Σ) + tr(ΣΣ−1 ) = log(2πe) + log det(eΣ).
2 2 2 2 2
3
Continuing our examples with normal distributions, we may compute the divergence between
two multivariate Gaussian distributions:
Example 2.1.7 (Divergence between Gaussian distributions): Let P be the multivariate
normal N(µ1 , Σ), and Q be the multivariate normal distribution with mean µ2 and identical
covariance Σ ≻ 0. Then we have that
1
Dkl (P ||Q) = (µ1 − µ2 )⊤ Σ−1 (µ1 − µ2 ). (2.1.5)
2
We leave the computation of the identity (2.1.5) to the reader. 3
19
Lexture Notes on Statistics and Information Theory John Duchi
An interesting consequence of Example 2.1.7 is that if a random vector X has a given covari-
ance Σ ∈ Rn×n , then the multivariate Gaussian with identical covariance has larger differential
entropy. Put another way, differential entropy for random variables with second moments is always
maximized by the Gaussian distribution.
Proposition 2.1.8. Let X be a random vector on Rn with a density, and assume that Cov(X) = Σ.
Then for Z ∼ N(0, Σ), we have
h(X) ≤ h(Z).
Proof Without loss of generality, we assume that X has mean 0. Let P be the distribution of
X with density p, and let Q be multivariate normal with mean 0 and covariance Σ; let Z be this
random variable. Then
Z Z
p(x) n 1 ⊤ −1
Dkl (P ||Q) = p(x) log dx = −h(X) + p(x) log(2π) − x Σ x dx
q(x) 2 2
= −h(X) + h(Z),
because Z has the same covariance as X. As 0 ≤ Dkl (P ||Q), we have h(Z) ≥ h(X) as desired.
We remark in passing that the fact that Gaussian random variables have the largest entropy has
been used to prove stronger variants of the central limit theorem; see the original results of Barron
[16], as well as later quantitative results on the increase of entropy of normalized sums by Artstein
et al. [9] and Madiman and Barron [143].
20
Lexture Notes on Statistics and Information Theory John Duchi
Chain rules for information and divergence: As another immediate corollary to the chain
rule for entropy, we see that mutual information also obeys a chain rule:
n
X
I(X; Y1n ) = I(X; Yi | Y1i−1 ).
i=1
Indeed, we have
n
X n
X
I(X; Y1n ) = H(Y1n ) − H(Y1n | X) = H(Yi | Y1i−1 ) − H(Yi | X, Y1i−1 ) = I(X; Yi | Y1i−1 ).
i=1 i=1
The KL-divergence obeys similar chain rules, making mutual information and KL-divergence mea-
sures useful tools for evaluation of distances and relationships between groups of random variables.
As a second example, suppose that the distribution P = P1 ×P2 ×· · ·×Pn , and Q = Q1 ×· · ·×Qn ,
that is, that P and Q are product distributions over independent random variables Xi ∼ Pi or
Xi ∼ Qi . Then we immediately have the tensorization identity
n
X
Dkl (P ||Q) = Dkl (P1 × · · · × Pn ||Q1 × · · · × Qn ) = Dkl (Pi ||Qi ) .
i=1
We remark in passing that these two identities hold for arbitrary distributions Pi and Qi or random
variables X, Y . As a final tensorization identiy, we consider a more general chain rule for KL-
divergences, which will frequently be useful. We abuse notation temporarily, and for random
variables X and Y with distributions P and Q, respectively, we denote
In analogy to the entropy, we can also define the conditional KL divergence. Let X and Y have
distributions PX|z and PY |z conditioned on Z = z, respectively. Then we define
Dkl (X||Y | Z) = EZ [Dkl PX|Z ||PY |Z ],
P
so that if Z is discrete we have Dkl (X||Y | Z) = z p(z)Dkl PX|z ||PY |z . With this notation, we
have the chain rule
n
X
Dkl Xi ||Yi | X1i−1 ,
Dkl (X1 , . . . , Xn ||Y1 , . . . , Yn ) = (2.1.6)
i=1
because (in the discrete case, which—as we discuss presently—is fully general for this purpose) for
distributions PXY and QXY we have
X p(x, y) X p(y | x) p(x)
Dkl (PXY ||QXY ) = p(x, y) log = p(x)p(y | x) log + log
x,y
q(x, y) x,y
q(y | x) q(x)
X p(x) X X p(y | x)
= p(x) log + p(x) p(y | x) log ,
x
q(x) x y
q(y | x)
P
where the final equality uses that y p(y | x) = 1 for all x. In different notation, if we let P and
Q be any distributions on X1 × · · · × Xn , and define Pi (A | xi−1 i−1
1 ) = P (Xi ∈ A | X1 = x1i−1 ), and
similarly for Qi , we have the following:
21
Lexture Notes on Statistics and Information Theory John Duchi
Expanding upon this, we give several tensorization identities, showing how to transform ques-
tions about the joint distribution of many random variables to simpler questions about their
marginals. As a first example, we see that as a consequence of the fact that conditioning de-
creases entropy, we see that for any sequence of (discrete or continuous, as appropriate) random
variables, we have
Both equalities hold with equality if and only if X1 , . . . , Xn are mutually independent. (The only
if follows because I(X; Y ) > 0 whenever X and Y are not independent, by Jensen’s inequality and
the fact that Dkl (P ||Q) > 0 unless P = Q.)
We return to information and divergence now. Suppose that random variables Yi are indepen-
dent conditional on X, meaning that
Such scenarios are common—as we shall see—when we make multiple observations from a fixed
distribution parameterized by some X. Then we have the inequality
n
X
I(X; Y1 , . . . , Yn ) = [H(Yi | Y1i−1 ) − H(Yi | X, Y1i−1 )]
i=1
n n n (2.1.7)
X X X
= [H(Yi | Y1i−1 ) − H(Yi | X)] ≤ [H(Yi ) − H(Yi | X)] = I(X; Yi ),
i=1 i=1 i=1
X → Y → Z.
Proposition 2.1.10. With the above Markov chain, we have I(X; Z) ≤ I(X; Y ).
22
Lexture Notes on Statistics and Information Theory John Duchi
where we note that the final equality follows because X is independent of Z given Y :
There are related data processing inequalities for the KL-divergence—which we generalize in
the next section—as well. In this case, we may consider a simple Markov chain X → Z. If we
let P1 and
R P2 be distributions on X and Q1 and Q2 be the induced distributions on Z, that is,
Qi (A) = P(Z ∈ A | x)dPi (x), then we have
the basic KL-divergence data processing inequality. A consequence of this is that, for any function
f and random variables X and Y on the same space, we have
We explore these data processing inequalities more when we generalize KL-divergences in the next
section and in the exercises.
1. The set X ∈ A.
2. The collection of sets A is closed under finite set operations: union, intersection, and com-
plementation. That is, A, B ∈ A implies that Ac ∈ A, A ∩ B ∈ A, and A ∪ B ∈ A.
There is a 1-to-1 correspondence between quantizers—and their associated partitions of the set
X —and finite algebras on a set X , which we discuss briefly.1 It should be clear that there is a
one-to-one correspondence between finite partitions of the set X and quantizers q, so we must argue
that finite partitions of X are in one-to-one correspondence with finite algebras defined over X .
1
Pedantically, this one-to-one correspondence holds up to permutations of the partition induced by the quantizer.
23
Lexture Notes on Statistics and Information Theory John Duchi
In one direction, we may consider a quantizer q : X → {1, . . . , m}. Let the sets A1 , . . . , Am
be the partition associated with q, that is, for x ∈ Ai we have q(x) = i, or Ai = q−1 ({i}). Then
we may define an algebra Aq as the collection of all finite set operations performed on A1 , . . . , Am
(note that this is a finite collection, as finite set operations performed on the partition A1 , . . . , Am
induce only a finite collection of sets).
For the other direction, consider a finite algebra A over the set X . We can then construct a
quantizer qA that corresponds to this algebra. To do so, we define an atom of A as any non-empty
set A ∈ A such that if B ⊂ A and B ∈ A, then B = A or B = ∅. That is, the atoms of A are the
“smallest” sets in A. We claim there is a unique partition of X with atomic sets from A; we prove
this inductively.
qA (x) = i when x ∈ Ai .
2.2.2 KL-divergence
In this section, we present the general definition of a KL-divergence, which holds for any pair of
distributions. Let P and Q be distributions on a space X . Now, let A be a finite algebra on X
(as in the previous section, this is equivalent to picking a partition of X and then constructing the
associated algebra), and assume that its atoms are atoms(A). The KL-divergence between P and
Q conditioned on A is
X P (A)
Dkl (P ||Q | A) := P (A) log .
Q(A)
A∈atoms(A)
That is, we simply sum over the partition of X . Another way to write this is as follows. Let
q : X → {1, . . . , m} be a quantizer, and define the sets Ai = q−1 ({i}) to be the pre-images of each
i (i.e. the different quantization regions, or the partition of X that q induces). Then the quantized
KL-divergence between P and Q is
m
X P (Ai )
Dkl (P ||Q | q) := P (Ai ) log .
Q(Ai )
i=1
We may now give the fully general definition of KL-divergence: the KL-divergence between P
and Q is defined as
24
Lexture Notes on Statistics and Information Theory John Duchi
This also gives a rigorous definition of mutual information. Indeed, if X and Y are random variables
with joint distribution PXY and marginal distributions PX and PY , we simply define
I(X; Y ) = Dkl (PXY ||PX × PY ) .
When P and Q have densities p and q, the definition (2.2.1) reduces to
Z
p(x)
Dkl (P ||Q) = p(x) log dx,
R q(x)
while if P and Q both have probability mass functions p and q, then—as we see in Exercise 2.6—the
definition (2.2.1) is equivalent to
X p(x)
Dkl (P ||Q) = p(x) log ,
x
q(x)
Measure-theoretic definition of KL-divergence If you have never seen measure theory be-
fore, skim this section; while the notation may be somewhat intimidating, it is fine to always
consider only continuous or fully discrete distributions. We will describe an interpretation that will
mean for our purposes that one never needs to really think about measure theoretic issues.
The general definition (2.2.1) of KL-divergence is equivalent to the following. Let µ be a measure
on X , and assume that P and Q are absolutely continuous with respect to µ, with densities p and
q, respectively. (For example, take µ = P + Q.) Then
Z
p(x)
Dkl (P ||Q) = p(x) log dµ(x). (2.2.2)
X q(x)
The proof of this fact is somewhat involved, requiring the technology of Lebesgue integration. (See
Gray [104, Chapter 5].)
For those who have not seen measure theory, the interpretation
R of the equality (2.2.2) should be
as follows. When integrating a function f (x), replace f (x)dµ(x) with one of two pairsR of symbols:
one may simply think of dµ(x) as dx, so that
R we are performing standard integration f (x)dx, or
one should think
R of the integralPoperation f (x)dµ(x) as summing the argument of the integral, so
dµ(x) = 1 and f (x)dµ(x) = x f (x). (This corresponds to µ being “counting measure” on X .)
25
Lexture Notes on Statistics and Information Theory John Duchi
2.2.3 f -divergences
A more general notion of divergence is the so-called f -divergence, or Ali-Silvey divergence [6, 59]
(see also the alternate interpretations in the article by Liese and Vajda [137]). Here, the definition
is as follows. Let P and Q be probability distributions on the set X , and let f : R+ → R be a
convex function satisfying f (1) = 0. If X is a discrete set, then the f -divergence between P and Q
is
X p(x)
Df (P ||Q) := q(x)f .
x
q(x)
More generally, for any set X and a quantizer q : X → {1, . . . , m}, letting Ai = q−1 ({i}) = {x ∈
X | q(x) = i} be the partition the quantizer induces, we can define the quantized divergence
m
X P (Ai )
Df (P ||Q | q) = Q(Ai )f ,
Q(Ai )
i=1
and the general definition of an f divergence is (in analogy with the definition (2.2.1) of general
KL divergences)
The definition (2.2.3) shows that, any time we have computations involving f -divergences—such
as KL-divergence or mutual information—it is no loss of generality, when performing the compu-
tations, to assume that all distributions have finite discrete support. There is a measure-theoretic
version of the definition (2.2.3) which is frequently easier to use. Assume w.l.o.g. that P and Q are
absolutely continuous with respect to the base measure µ. The f divergence between P and Q is
then Z
p(x)
Df (P ||Q) := q(x)f dµ(x). (2.2.4)
X q(x)
This definition, it turns out, is not quite as general as we would like—in particular, it is unclear
how we should define the integral for points x such that q(x) = 0. With that in mind, we recall
that the perspective transform (see Appendices B.1.1 and B.3.3) of a function f : R → R is defined
by pers(f )(t, u) = uf (t/u) if u > 0 and by +∞ if u ≤ 0. This function is convex in its arguments
(Proposition B.3.12). In fact, this is not quite enough for the fully correct definition. The closure of
a convex function f is cl f (x) = sup{ℓ(x) | ℓ ≤ f, ℓ linear}, the supremum over all linear functions
that globally lower bound f . Then [111, Proposition IV.2.2.2] the closer of pers(f ) is defined, for
any t′ ∈ int dom f , by
uf (t/u)
if u > 0
′
cl pers(f )(t, u) = limα↓0 αf (t − t + t/α) if u = 0
+∞ if u < 0.
(The choice of t′ does not affect the definition.) Then the fully general formula expressing the
f -divergence is Z
Df (P ||Q) = cl pers(f )(p(x), q(x))dµ(x). (2.2.5)
X
This is what we mean by equation (2.2.4), which we use without comment.
In the exercises, we explore several properties of f -divergences, including the quantized repre-
sentation (2.2.3), showing different data processing inequalities and orderings of quantizers based
26
Lexture Notes on Statistics and Information Theory John Duchi
on the fineness of their induced partitions. Broadly, f -divergences satisfy essentially the same prop-
erties as KL-divergence, such as data-processing inequalities, and they provide a generalization of
mutual information. We explore f -divergences from additional perspectives later—they are impor-
tant both for optimality in estimation and related to consistency and prediction problems, as we
discuss in Chapter 16.5.
Examples We give several examples of f -divergences here; in Section 9.2.2 we provide a few
examples of their uses as well as providing a few natural inequalities between them.
Example 2.2.1 (KL-divergence): By taking f (t) = t log t, which is convex and satisfies
f (1) = 0, we obtain Df (P ||Q) = Dkl (P ||Q). 3
Example 2.2.3 (Total variation distance): The total variation distance between probability
distributions P and Q defined on a set X is the maximum difference between probabilities they
assign on subsets of X :
∥P − Q∥TV := sup |P (A) − Q(A)| = sup (P (A) − Q(A)), (2.2.6)
A⊂X A⊂X
where the second equality follows by considering compliments P (Ac ) = 1 − P (A). The total
variation distance, as we shall see later, is important for verifying the optimality of different
tests, and appears in the measurement of difficulty of solving hypothesis testing problems. The
choice f (t) = 21 |t − 1|, we obtain the total variation distance, that is, ∥P − Q∥TV = Df (P ||Q).
There are several alternative characterizations, which we provide as Lemma 2.2.4 next; it will
be useful in the sequel when we develop inequalities relating the divergences. 3
Lemma 2.2.4. Let P, Q be probability measures with densities p, q with respect to a base measure
µ and f (t) = 21 |t − 1|. Then
Z
1
∥P − Q∥TV = Df (P ||Q) = |p(x) − q(x)|dµ(x)
2
Z Z
= [p(x) − q(x)]+ dµ(x) = [q(x) − p(x)]+ dµ(x)
27
Lexture Notes on Statistics and Information Theory John Duchi
R
Considering the last inegral [q(x) − p(x)]+ dµ(x), we see that the set A = {x : q(x) > p(x)}
satisfies
Z Z
Q(A) − P (A) = (q(x) − p(x))dµ(x) ≥ (q(x) − p(x))dµ(x) = Q(B) − P (B)
A B
Example 2.2.5 (Hellinger distance): The Hellinger distance between √ probability distribu-
√
tions P and Q defined on a set X is generated by the function f (t) = ( t − 1)2 = t − 2 t + 1.
The Hellinger distance is then
Z p
2 1 p
dhel (P, Q) := ( p(x) − q(x))2 dµ(x). (2.2.7)
2
The non-squared version dhel (P, Q) is indeed a distance between probability measures P and
Q. It is sometimes convenient to rewrite the Hellinger distance in terms of the affinity between
P and Q, as
Z Z p
2 1 p
dhel (P, Q) = (p(x) + q(x) − 2 p(x)q(x))dµ(x) = 1 − p(x)q(x)dµ(x), (2.2.8)
2
which makes clear that dhel (P, Q) ∈ [0, 1] is on roughly the same scale as the variation distance;
we will say more later. 3
Example 2.2.6 (χ2 divergence): The χ2 -divergence is generated by taking f (t) = (t − 1)2 ,
so that 2
p(x)2
Z Z
p(x)
Dχ2 (P ||Q) := − 1 q(x)dµ(x) = dµ(x) − 1, (2.2.9)
q(x) q(x)
where the equality is immediate because pdµ = qdµ = 1. 3
R R
Proposition 2.2.7. The total variation distance and Hellinger distance satisfy
q
2
dhel (P, Q) ≤ ∥P − Q∥TV ≤ dhel (P, Q) 2 − d2hel (P, Q).
28
Lexture Notes on Statistics and Information Theory John Duchi
Proof We begin with the upper bound. We have by Hölder’s inequality that
Z Z p
1 p p p
|p(x) − q(x)|dµ(x) = | p(x) − q(x)| · | p(x) + q(x)|dµ(x)
2
Z 1 Z 1
1 p p 2
2 1 p p 2
2
≤ ( p(x) − q(x)) dµ(x) ( p(x) + q(x)) dµ(x)
2 2
Z p 1
2
= dhel (P, Q) 1 + p(x)q(x)dµ(x) .
Rp
As in Example 2.2.5, we have p(x)q(x)dµ(x) = 1 − dhel (P, Q)2 , so this (along with the repre-
sentation Lemma 2.2.4 for variation distance) implies
Z
1 1
∥P − Q∥TV = |p(x) − q(x)|dµ(x) ≤ dhel (P, Q)(2 − d2hel (P, Q)) 2 .
2
√
For the lower bound on total variation, note that for any a, b ∈ R+ , we have a + b − 2 ab ≤ |a − b|
(check the cases a > b and a < b separately); thus
Z Z
2 1 h p i 1
dhel (P, Q) = p(x) + q(x) − 2 p(x)q(x) dµ(x) ≤ |p(x) − q(x)|dµ(x),
2 2
as desired.
Several important inequalitites relate the variation distance to the KL-divergence. We state
two important inequalities in the next proposition, both of which are important enough to justify
their own names.
Proposition 2.2.8. The total variation distance satisfies the following relationships.
(a) Pinsker’s inequality: for any distributions P and Q,
1
∥P − Q∥2TV ≤ Dkl (P ||Q) . (2.2.10)
2
Proof Exercise 2.19 outlines one proof of Pinsker’s inequality using the data processing inequality
(Proposition 2.2.13). We present an alternative via the Cauchy-Schwarz inequality. Using the
definition (2.2.1) of the KL-divergence, we may assume without loss of generality that P and Q are
finitely P
supported, say with p.m.f.s p1 , . . . , pm and q1 , . . . , qm . Define the negative entropy function
h(p) = m 2 1 2
i=1 pi log pi . Then showing that Dkl (P ||Q) ≥ 2 ∥P − Q∥TV = 2 ∥p − q∥1 is equivalent to
showing that
1
h(p) ≥ h(q) + ⟨∇h(q), p − q⟩ + ∥p − q∥21 , (2.2.11)
2
because by inspection h(p)−h(q)−⟨∇h(q), p−q⟩ = i pi log pqii . We do this via a Taylor expansion:
P
we have
∇h(p) = [log pi + 1]m 2
i=1 and ∇ h(p) = diag([1/pi ]i=1 ).
m
29
Lexture Notes on Statistics and Information Theory John Duchi
By Taylor’s theorem, there is some p̃ = (1 − t)p + tq, where t ∈ [0, 1], such that
1
h(p) = h(q) + ⟨∇h(q), p − q⟩ + ⟨p − q, ∇2 h(p̃)(p − q)⟩.
2
P
But looking at the final quadratic, we have for any vector v and any p ≥ 0 satisfying i pi = 1,
m m m 2
v2 v2
X
X X √ |vi |
2
⟨v, ∇ h(p̃)v⟩ = i
= ∥p∥1 i
≥ pi √ = ∥v∥21 ,
pi pi pi
i=1 i=1 i=1
√ √
where the inequality follows from Cauchy-Schwarz applied to the vectors [ pi ]i and [|vi |/ pi ]i .
Thus inequality (2.2.11) holds. Rp
For the claim (b), we use Proposition 2.2.7. Let a = p(x)q(x)dµ(x) be a shorthand
√ √ for the
2
affinity, so that dhel (P, Q) = 1 − a. Then Proposition 2.2.7 gives ∥P − Q∥TV ≤ 1 − a 1 + a =
√
1 − a2 . Now apply Jensen’s inequality to the exponential: we have
Z p Z s Z
q(x) 1 q(x)
p(x)q(x)dµ(x) = p(x)dµ(x) = exp log p(x)dµ(x)
p(x) 2 p(x)
Z
1 q(x) 1
≥ exp p(x) log dµ(x) = exp − Dkl (P ||Q) .
2 p(x) 2
√ q
In particular, 1 − a2 ≤ 1 − exp(− 12 Dkl (P ||Q))2 , which is the first claim of part (b). For the
√
second, note that 1 − c ≤ 1 − 12 c for c ∈ [0, 1] by concavity of the square root.
We also have the following bounds on the Hellinger distance in terms of the KL-divergence, and
that in terms of the χ2 -divergence.
Proof For the first inequality, note that log x ≤ x − 1 by concavity, or 1 − x ≤ − log x, so that
Z p
2d2hel (P, Q) = 2 − 2 p(x)q(x)dµ(x)
Z s ! Z s
q(x) p(x)
= 2 p(x) 1 − dµ(x) ≤ 2 p(x) log dµ(x) = Dkl (P ||Q) .
p(x) q(x)
dP 2
Z
Dkl (P ||Q) ≤ log = log(1 + Dχ2 (P ||Q)).
dQ
The last inequality is immediate as log(1 + t) ≤ t for all t > −1.
It is also possible to relate mutual information between distributions to f -divergences, and even
to bound the mutual information above and below by the Hellinger distance for certain problems. In
30
Lexture Notes on Statistics and Information Theory John Duchi
this case, we consider the following situation: let V ∈ {0, 1} uniformly at random, and conditional
on V = v, draw X ∼ Pv for some distribution Pv on a space X . Then we have that
1 1
I(X; V ) = Dkl P0 ||P + Dkl P1 ||P
2 2
where P = 21 P0 + 12 P1 . The divergence measure on the right side of the preceding identity is a
special case of the Jenson-Shannon divergence, defined for λ ∈ [0, 1] by
Djs,λ (P ||Q) := λDkl (P ||λP + (1 − λ)Q) + Dkl (Q||λP + (1 − λ)Q) , (2.2.12)
which is a symmetrized and bounded variant of the typical KL-divergence (we use the shorthand
Djs (P ||Q) := Djs, 1 (P ||Q) for the symmetric case). As a consequence, we also have
2
1 1
I(X; V ) = Df (P0 ||P1 ) + Df (P1 ||P0 ) ,
2 2
1
where f (t) = −t log( 2t + 21 ) = t log t+1
2t
, so that the mutual information is a particular f -divergence.
This form—as we see in the later chapters—is frequently convenient because it gives an object
with similar tensorization properties to KL-divergence while enjoying the boundedness properties
of Hellinger and variation distances. The following proposition captures the latter properties.
Proposition 2.2.10. Let (X, V ) be distributed as above. Then
2 log 2 · ∥P0 − P1 ∥TV ,
log 2 · dhel (P0 , P1 ) ≤ I(X; V ) = Djs (P0 ||P1 ) ≤ min .
2 · d2hel (P0 , P1 )
Proof The lower bound and upper bound involving the variation distance both follow from
analytic bounds on the binary entropy functional h2 (p) = −p log p−(1−p) log(1−p). By expanding
the mutual information and letting p0 and p1 be densities of P0 and P1 with respect to some base
measure µ, we have
Z Z
2p0 2p1
2I(X; V ) = 2Djs (P0 ||P1 ) = p0 log dµ + p1 log dµ
p0 + p1 p0 + p1
Z
p0 p0 p1 p1
= 2 log 2 + (p0 + p1 ) log + log dµ
p1 + p1 p0 + p1 p1 + p1 p0 + p1
Z
p0
= 2 log 2 − (p0 + p1 )h2 dµ.
p1 + p0
We claim that p
2 log 2 · min{p, 1 − p} ≤ h2 (p) ≤ 2 log 2 · p(1 − p)
for all p ∈ [0, 1] (see Exercises 2.17 and 2.18). Then the upper and lower bounds on the information
become nearly immediate.
For the variation-based upper bound on I(X; V ), we use the lower bound h2 (p) ≥ 2 log 2 ·
min{p, 1 − p} to write
Z
2 p0 (x) p1 (x)
I(X; V ) ≤ 2 − (p0 (x) + p1 (x)) min , dµ(x)
log 2 p0 (x) + p1 (x) p0 (x) + p1 (x)
Z
= 2 − 2 min{p0 (x), p1 (x)}dµ(x)
Z Z
= 2 (p1 (x) − min{p0 (x), p1 (x)})dµ(x) = 2 (p1 (x) − p0 (x))dµ(x).
p1 >p0
31
Lexture Notes on Statistics and Information Theory John Duchi
But of course the final integral is ∥P1 − P0 ∥TV , giving I(X; V ) ≤ log 2 ∥P0 −pP1 ∥TV . Conversely,
for the lower bound on Djs (P0 ||P1 ), we use the upper bound h2 (p) ≤ 2 log 2 · p(1 − p) to obtain
Z r
1 p0 p0
I(X; V ) ≥ 1 − (p0 + p1 ) 1− dµ
log 2 p1 + p0 p1 + p0
√ √ √
Z Z
1
=1− p0 p1 dµ = ( p0 − p1 )2 dµ = d2hel (P0 , P1 )
2
as desired.
The Hellinger-based upper bound is simpler: by Proposition 2.2.9, we have
1 1
Djs (P0 ||P1 ) = Dkl (P0 ||(P0 + P1 )/2) + Dkl (P1 ||(P0 + P1 )/2)
2 2
1 1
≤ Dχ2 (P0 ||(P0 + P1 )/2) + Dχ2 (P1 ||(P0 + P1 )/2)
2 2
Z √ √ √ √
(p0 − p1 )2 ( p0 − p1 )2 ( p0 + p1 )2
Z
1 1
= dµ = dµ.
2 p0 + p1 2 p0 + p1
√ √
Now note that (a + b)2 ≤ 2aR2 + 2b2 for any a, b ∈ R, and so ( p0 + p1 )2 ≤ 2(p0 + p1 ), and thus
√ √ 2
the final integral has bound ( p0 − p1 ) dµ = 2d2hel (P0 , P1 ).
Df (λP1 + (1 − λ)P2 ||λQ1 + (1 − λ)Q2 ) ≤ λDf (P1 ||Q1 ) + (1 − λ)Df (P2 ||Q2 ) .
The proof of this proposition we leave as Exercise 2.11, which we treat as a consequence of the
more general “log-sum” like inequalities of Exercise 2.8. It is, however, an immediate consequence
of the fully specified definition (2.2.5) of an f -divergence, because pers(f ) is jointly convex. As an
immediate corollary, we see that the same result is true for KL-divergence as well.
Corollary 2.2.12. The KL-divergence Dkl (P ||Q) is jointly convex in its arguments P and Q.
We can also provide more general data processing inequalities for f -divergences, paralleling
those for the KL-divergence. In this case, we consider random variables X and Z on spaces X
and Z, respectively, and a Markov transition kernel K giving the Markov chain X → Z. That
is, K(· | x) is a probability distribution on Z for each x ∈ X , and conditioned on X = x, Z has
distribution K(· | x) so that K(A | x) = P(Z ∈ A | X = x). Certainly, this includes the situation
32
Lexture Notes on Statistics and Information Theory John Duchi
when Z = ϕ(X) for some function ϕ, and more generally when Z = ϕ(X, U ) for a function ϕ and
some additional randomness U . For a distribution P on X, we then define the marginals
Z
KP (A) := K(A, x)dP (x).
X
Proposition 2.2.13. Let P and Q be distributions on X and let K be any Markov kernel. Then
Thus, further processing of random variables can only bring them “closer” in the space of distribu-
tions; downstream processing of signals cannot make them further apart as distributions.
33
Lexture Notes on Statistics and Information Theory John Duchi
We can give an exact expression for the minimal possible error in the above hypothesis test.
Indeed, a standard result of Le Cam (see [134, 194, Lemma 1]) is the following variational representa-
tion of the total variation distance (2.2.6), which is the f -divergence associated with f (t) = 12 |t − 1|,
as a function of testing error.
Proposition 2.3.1. Let X be an arbitrary set. For any distributions P1 and P2 on X , we have
Proof Any test Ψ : X → {1, 2} has an acceptance region, call it A ⊂ X , where it outputs 1 and
a region Ac where it outputs 2.
inf {P1 (Ψ ̸= 1) + P2 (Ψ ̸= 2)} = inf {1 − (P1 (A) − P2 (A))} = 1 − sup (P1 (A) − P2 (A)),
Ψ A⊂X A⊂X
In the two-hypothesis case, we also know that the optimal test, by the Neyman-Pearson lemma,
is a likelihood ratio test. That is, assuming that P1 and P2 have densities p1 and p2 , the optimal
test is of the form
1 if pp12 (X)
(
(X) ≥ t
Ψ(X) = p1 (X)
2 if p2 (X) < t
for some threshold t ≥ 0. In the case that the prior probabilities on P1 and P2 are each 21 , then
t = 1 is optimal.
We give one example application of Proposition 2.3.1 to the problem of testing a normal mean.
iid
Example 2.3.2 (Testing a normal mean): Suppose we observe X1 , . . . , Xn ∼ P for P = P1
or P = P2 , where Pv is the normal distribution N(µv , σ 2 ), where µ1 ̸= µ2 . We would like to
understand the sample size n necessary to guarantee that no test can have small error, that
is, say, that
1
inf {P1 (Ψ(X1 , . . . , Xn ) ̸= 1) + P2 (Ψ(X1 , . . . , Xn ) ̸= 2)} ≥ .
Ψ 2
By Proposition 2.3.1, we have that
iid
where Pvn denotes the n-fold product of Pv , that is, the distribution of X1 , . . . , Xn ∼ Pv .
The interaction between total variation distance and product distributions is somewhat
subtle, so it is often advisable to use a divergence measure more attuned to the i.i.d. nature
of the sampling scheme. Two such measures are the KL-divergence and Hellinger distance,
both of which we explore in the coming chapters. With that in mind, we apply Pinsker’s
34
Lexture Notes on Statistics and Information Theory John Duchi
inequality (2.2.10) to see that ∥P1n − P2n ∥2TV ≤ 21 Dkl (P1n ||P2n ) = n2 Dkl (P1 ||P2 ), which implies
that
r r 1 √
n n n 1 n 1 2
2 n |µ1 − µ2 |
1 − ∥P1 − P2 ∥TV ≥ 1 − Dkl (P1 ||P2 ) = 1 −
2
2
(µ1 − µ2 ) =1− .
2 2 2σ 2 σ
σ2
In particular, if n ≤ (µ1 −µ2 )2
, then we have our desired lower bound of 21 .
2
Conversely, a calculation yields that n ≥ (µ1Cσ −µ2 )2
, for some numerical constant C ≥ 1,
implies small probability of error. We leave this calculation to the reader. 3
and we wish to provide lower bounds on the probability of error—that is, that X b ̸= X. If we let
the function h2 (p) = −p log p − (1 − p) log(1 − p) denote the binary entropy (entropy of a Bernoulli
random variable with parameter p), Fano’s inequality takes the following form [e.g. 57, Chapter 2]:
Proposition 2.3.3 (Fano inequality). For any Markov chain X → Y → X,
b we have
b ̸= X)) + P(X
h2 (P(X b ̸= X) log(|X | − 1) ≥ H(X | X).
b (2.3.1)
Proof This proof follows by expanding an entropy functional in two different ways. Let E be
b ̸= X, that is, E = 1 if X
the indicator for the event that X b ̸= X and is 0 otherwise. Then we have
H(X, E | X)
b = H(X | E, X)
b + H(E | X)
b
= P(E = 1)H(X | E = 1, X)
b + P(E = 0) H(X | E = 0, X)
b +H(E | X),
b
| {z }
=0
where the zero follows because given there is no error, X has no variability given X.
b Expanding
the entropy by the chain rule in a different order, we have
H(X, E | X)
b = H(X | X)
b + H(E | X,
b X),
| {z }
=0
H(X | X)
b = H(X, E | X)
b = P(E = 1)H(X | E = 1, X)
b + H(E | X).
Noting that H(E | X) ≤ H(E) = h2 (P(E = 1)), as conditioning reduces entropy, and that
H(X | E = 1, X)
b ≤ log(|X | − 1), as X can take on at most |X | − 1 values when there is an error,
35
Lexture Notes on Statistics and Information Theory John Duchi
b ̸= X) ≥ 1 − I(X; Y ) + log 2
P(X . (2.3.2)
log(|X |)
Proof Let Perror = P(X ̸= X) b denote the probability of error. Noting that h2 (p) ≤ log 2 for any
p ∈ [0, 1] (recall inequality (2.1.2), that is, that uniform random variables maximize entropy), then
using Proposition 9.4.1, we have
(i) (ii)
log 2 + Perror log(|X |) ≥ h2 (Perror ) + Perror log(|X | − 1) ≥ H(X | X)
b = H(X) − I(X; X).
b
Here step (i) uses Proposition 2.3.3 and step (ii) uses the definition of mutual information, that
b = H(X) − H(X | X).
I(X; X) b The data processing inequality implies that I(X; X) b ≤ I(X; Y ),
and using H(X) = log(|X |) completes the proof.
In particular, Corollary 2.3.4 shows that when X is chosen uniformly at random and we observe
Y , we have
I(X; Y ) + log 2
inf P(Ψ(Y ) ̸= X) ≥ 1 − ,
Ψ log |X |
where the infimum is taken over all testing procedures Ψ. Some interpretation of this quantity
is helpful. If we think roughly of the number of bits it takes to describe a variable X uniformly
chosen from X , then we expect that log2 |X | bits are necessary (and sufficient). Thus, until we
collect enough information that I(X; Y ) ≈ log |X |, so that I(X; Y )/ log |X | ≈ 1, we are unlikely to
be unable to identify the variable X with any substantial probability. So we must collect enough
bits to actually discover X.
Example 2.3.5 (20 questions game): In the 20 questions game—a standard children’s game—
there are two players, the “chooser” and the “guesser,” and an agreed upon universe X . The
chooser picks an element x ∈ X , and the guesser’s goal is to find x by using a series of yes/no
questions about x. We consider optimal strategies for each player in this game, assuming that
X is finite and letting m = |X | be the universe size for shorthand.
For the guesser, it is clear that at most ⌈log2 m⌉ questions are necessary to guess the item
X that the chooser has picked—at each round of the game, the guesser asks a question that
eliminates half of the remaining possible items. Indeed, let us assume that m = 2l for some
l ∈ N; if not, the guesser can always make her task more difficult by increasing the size of X
until it is a power of 2. Thus, after k rounds, there are m2−k items left, and we have
k
1
m ≤ 1 if and only if k ≥ log2 m.
2
36
Lexture Notes on Statistics and Information Theory John Duchi
b ̸= X) ≥ 1 − I(X; Y1 , . . . , Yk ) + log 2
P(X .
log m
By the chain rule for mutual information, we have
k
X k
X k
X
I(X; Y1 , . . . , Yk ) = I(X; Yi | Y1:i−1 ) = H(Yi | Y1:i−1 ) − H(Yi | Y1:i−1 , X) ≤ H(Yi ).
i=1 i=1 i=1
As the answers Yi are yes/no, we have H(Yi ) ≤ log 2, so that I(X; Y1:k ) ≤ k log 2. Thus we
find
P(Xb ̸= X) ≥ 1 − (k + 1) log 2 = log2 m − 1 − k ,
log m log2 m log2 m
so that we the guesser must have k ≥ log2 (m/2) to be guaranteed that she will make no
mistakes. 3
37
Lexture Notes on Statistics and Information Theory John Duchi
While more useful (generally) than simply non-singular codes, uniquely decodable codes may require
inspection of an entire string before recovering the first element. With that in mind, we now consider
the easiest to use codes, which can always be decoded instantaneously.
As is hopefully apparent from the definitions, all prefix/instantaneous codes are uniquely decodable,
which are in turn non-singular. The converse is not true, though we will see a sense in which—as
long as we care only about encoding sequences—using prefix instead of uniquely decodable codes
has negligible consequences.
For example, written English, with periods (.) and spaces ( ) included at the ends of words
(among other punctuation) is an instantaneous encoding of English into the symbols of the alphabet
and punctuation, as punctuation symbols enforce that no “codeword” is a prefix of any other. A
few more concrete examples may make things more clear.
Example 2.4.1 (Encoding strategies): Consider the encoding schemes below, which encode
the letters a, b, c, and d.
Symbol C1 (x) C2 (x) C3 (x)
a 0 00 0
b 00 10 10
c 000 11 110
d 0000 110 111
By inspection, it is clear that C1 is non-singular but certainly not uniquely decodable (does
the sequence 0000 correspond to aaaa, bb, aab, aba, baa, ca, ac, or d?), while C3 is a prefix
code. We leave showing that C2 is uniquely decodable as an exercise. 3
Theorem 2.4.2. Let X be a finite or countable set, and let ℓ : X → N be a function. If ℓ(x) is the
length of the encoding of the symbol x in a uniquely decodable d-ary code, then
X
d−ℓ(x) ≤ 1. (2.4.1)
x∈X
Conversely, given any function ℓ : X → N satisfying inequality (2.4.1), there is a prefix code whose
codewords have length ℓ(x) for each x ∈ X .
Proof We prove the first statement of the theorem first by a counting and asymptotic argument.
We begin by assuming that X is finite; we eliminate this assumption subsequently. As a
consequence, there is some maximum length ℓmax such that ℓ(x) ≤ ℓmax for all x ∈ X . ForP a sequence
x1 , . . . , xn ∈ X , we have by the definition of our encoding strategy that ℓ(x1 , . . . , xn ) = ni=1 ℓ(xi ).
In addition, for each m we let
38
Lexture Notes on Statistics and Information Theory John Duchi
0 2
1
x1
2 0 2
0 1 1
x2 x3 x5 x6 x7
Figure 2.1. Prefix-tree encoding of a set of symbols. The encoding for x1 is 0, for x2 is 10, for x3
is 11, for x4 is 12, for x5 is 20, for x6 is 21, and nothing is encoded as 1, 2, or 22.
denote the symbols x encoded with codewords of length m in our code, then as the code is uniquely
decodable we certainly have card(En (m)) ≤ dm P for all n and m. Moreover, for all x1:n ∈ X n we
have ℓ(x1:n ) ≤ nℓmax . We thus re-index the sum x d−ℓ(x) and compute
X nℓ
X max
≤ dm−m = nℓmax .
m=1
as each subset {xP ∈ X : ℓ(x) ≤ k} is uniquely decodable, we have Dk ≤ 1 for all k. Then
1 ≥ limk→∞ Dk = x∈X d−ℓ(x) .
The achievability of such a code intuitively follows by a pictorial argument (recall Figure 2.1),
so we first sketch the result non-rigorously. Indeed, let Td be an (infinite) d-ary tree. Then, at each
39
Lexture Notes on Statistics and Information Theory John Duchi
level m of the tree, assign one of the nodes at that level to each symbol x ∈ X such that ℓ(x) = m.
Eliminate the subtree below that node, and repeat with the remaining symbols. The codeword
corresponding to symbol x is then the path to the symbol in the tree.
P A more formal version implementing this sketch follows. Let ℓ be a length function satisfying
−ℓ(x) ≤ 1. Identify X with N (or a subset thereof) in such a way that 1 ≤ ℓ(1) ≤ ℓ(2) ≤ . . .,
x∈X d
i.e., ℓ(x) ≤ ℓ(y) whenever x < y, and let Xm = {x ∈ Xm | ℓ(x) = m} be the set of inputs with
encoding length m. For each x ∈ N, define the value
X
v(x) = d−ℓ(i) .
i<x
We let the codeword C(x) for x be the first ℓ(x) terms in the d-ary expansion of v(x). Certainly the
length of this encoding satisfies |C(x)| = ℓ(x). To see that it is prefix-free, take two symbols x < y,
and assume for the sake of contradiction that C(x) is a prefix of C(y). Then v(y) ≥ v(x), while
v(y) − v(x) ≤ d−ℓ(x) because the two representations agree on the first ℓ(x) terms in the expansion.
But
X X X X
v(y) − v(x) = d−ℓ(i) − d−ℓ(i) = d−ℓ(i) = d−ℓ(x) + d−ℓ(i) > d−ℓ(x) ,
i<y i<x x≤i<y x<i<y
a contradiction.
With the Kraft-McMillan theorem in place, we we may directly relate the entropy of a random
variable to the length of possible encodings for the variable; in particular, we show that the entropy
is essentially the best possible code length of a uniquely decodable source code. In this theorem,
we use the shorthand X
Hd (X) := − p(x) logd p(x).
x∈X
Theorem 2.4.3. Let X ∈ X be a discrete random variable distributed according to P and let ℓC
be the length function associated with a d-ary encoding C : X → {0, . . . , d − 1}∗ . In addition, let C
be the set of all uniquely decodable d-ary codes for X . Then
Proof The lower bound is an argument by convex optimization, while for the upper bound
we give an explicit length function and (implicit) prefix code attaining the bound. For the lower
bound, we assume for simplicity that X is finite, and we identify X = {1, . . . , |X |} (let m = |X | for
shorthand). Then as C consists of uniquely decodable codebooks, all the associated length functions
must satisfy the Kraft-McMillan inequality (2.4.1). Letting ℓi = ℓ(i), the minimal encoding length
is at least (m m
)
X X
infm pi ℓi : d−ℓi ≤ 1 .
ℓ∈R
i=1 i=1
By introducing the Lagrange multiplier λ ≥ 0 for the inequality constraint, we may write the
Lagrangian for the preceding minimization problem as
n
!
X h im
L(ℓ, λ) = p⊤ ℓ + λ d−ℓi − 1 with ∇ℓ L(ℓ, λ) = p − λ d−ℓi log d .
i=1
i=1
40
Lexture Notes on Statistics and Information Theory John Duchi
θ
θ Pm − logd
In particular, the optimal ℓ satisfies ℓi = logd pi for some constant θ, and solving i=1 d
pi
=1
gives θ = 1 and ℓ(i) = logd p1i .
l m
1
To attain the result, simply set our encoding to be ℓ(x) = logd P (X=x) , which satisfies the
Kraft-McMillan inequality and thus yields a valid prefix code with
X 1 X
EP [ℓ(X)] = p(x) logd ≤− p(x) logd p(x) + 1 = Hd (X) + 1
p(x)
x∈X x∈X
as desired.
Theorem 2.4.3 thus shows that, at least to within an additive constant of 1, the entropy both
upper and lower bounds the expected length of a uniquely decodable code for the random variable
X. This is the first of our promised “operational interpretations” of the entropy.
In some situations, the limit (2.4.2) may not exist. However, there are a variety of situations in
which it does, and we focus generally on a specific but common instance in which the limit does
exist. First, we recall the definition of a stationary sequence of random variables.
Definition 2.5. We say a sequence X1 , X2 , . . . of random variable is stationary if for all n and all
k ∈ N and all measurable sets A1 , . . . , Ak ⊂ X we have
Proposition 2.4.4. Let the sequence of random variables {Xi }, taking values in the discrete space
X , be stationary. Then
H({Xi }) = lim H(Xn | X1 , . . . , Xn−1 )
n→∞
41
Lexture Notes on Statistics and Information Theory John Duchi
1 Pn
Proof We begin by making the following standard observation of Cesàro means: if cn = n i=1 ai
and ai → a, then cn → a.3 Now, we note that for a stationary sequence, we have that
H(Xn | X1:n−1 ) = H(Xn+1 | X2:n ),
and using that conditioning decreases entropy, we have
H(Xn+1 | X1:n ) ≤ H(Xn | X1:n−1 ).
Thus the sequence an := H(Xn | X1:n−1 ) is non-increasing and
P bounded below by 0, so that it has
some limit limn→∞ H(Xn | X1:n−1 ). As H(X1 , . . . , Xn ) = ni=1 H(Xi | X1:i−1 ) by the chain rule
for entropy, we achieve the result of the proposition.
Finally, we present a result showing that it is possible to achieve average code length of at most
the entropy rate, which for stationary sequences is smaller than the entropy of any single random
variable Xi . To do so, we require the use of a block code, which (while it may be prefix code) treats
sets of random variables (X1 , . . . , Xm ) ∈ X m as a single symbol to be jointly encoded.
Proposition 2.4.5. Let the sequence of random variables X1 , X2 , . . . be stationary. Then for any
ϵ > 0, there exists an m ∈ N and a d-ary (prefix) block encoder C : X m → {0, . . . , d − 1}∗ such that
1
lim EP [ℓC (X1:n )] ≤ H({Xi }) + ϵ = lim H(Xn | X1 , . . . , Xn−1 ) + ϵ.
n n n
Proof m ∗
Let C : X → {0, 1, . . . , d − 1} be any prefix code with
1
ℓC (x1:m ) ≤ log .
P (X1:m = x1:m )
Then whenever n/m is an integer, we have
n/m n/m
X X
EP [ℓC (X1:n )] = EP ℓC (Xmi+1 , . . . , Xm(i+1) ) ≤ H(Xmi+1 , . . . , Xm(i+1) ) + 1
i=1 i=1
n n
= + H(X1 , . . . , Xm ).
m m
1 1
Dividing by n gives the result by taking m suitably large that m +m H(X1 , . . . , Xm ) ≤ ϵ+H({Xi }).
Note that if the m does not divide n, we may also encode the length of the sequence of encoded
words in each block of length m; in particular, if the block begins with a 0, it encodes m symbols,
while if it begins with a 1, then the next ⌈logd m⌉ bits encode the length of the block. This would
yields an increase in the expected length of the code to
2n + ⌈log2 m⌉ n
EP [ℓC (X1:n )] ≤ + H(X1 , . . . , Xm ).
m m
Dividing by n and letting n → ∞ gives the result, as we can always choose m large.
3
Indeed, let ϵ > 0 and take N such that n ≥ N implies that |ai − a| < ϵ. Then for n ≥ N , we have
n n
1X N (cN − a) 1 X N (cN − a)
cn − a = (ai − a) = + (ai − a) ∈ ± ϵ.
n i=1 n n i=N +1 n
Taking n → ∞ yields that the term N (cN − a)/n → 0, which gives that cn − a ∈ [−ϵ, ϵ] eventually for any ϵ > 0,
which is our desired result.
42
Lexture Notes on Statistics and Information Theory John Duchi
2.5 Bibliography
The material in this chapter is classical in information theory. For all of our treatment of mutual
information, entropy, and KL-divergence in the discrete case, Cover and Thomas provide an essen-
tially complete treatment in Chapter 2 of their book [57]. Gray [104] provides a more advanced
(measure-theoretic) version of these results, with Chapter 5 covering most of our results (or Chap-
ter 7 in the newer addition of the same book). Csiszár and Körner [61] is the classic reference for
coding theorems and results on communication, including stronger converse results.
The f -divergence was independently discovered by Ali and Silvey [6] and Csiszár [59], and is
consequently sometimes called an Ali-Silvey divergence or Csiszár divergence. Liese and Vajda [137]
provide a survey of f -divergences and their relationships with different statistical concepts (taking a
Bayesian point of view), and various authors have extended the pairwise divergence measures to di-
vergence measures between multiple distributions [107], making connections to experimental design
and classification [98, 76], which we investigate later in book. The inequalities relating divergences
in Section 2.2.4 are now classical, and standard references present them [134, 182]. For a proof that
equality (2.2.4) is equivalent to the definition (2.2.3) with the appropriate closure operations, see
the paper [76, Proposition 1]. We borrow the proof of the upper bound in Proposition 2.2.10 from
the paper [138].
JCD Comment: Converse to Kraft is Chaitin?
2.6 Exercises
Our first few questions investigate properties of a divergence between distributions that is weaker
than the KL-divergence, but is intimately related to optimal testing. Let P1 and P2 be arbitrary
distributions on a space X . The total variation distance between P1 and P2 is defined as
Exercise 2.1: Prove the following identities about total variation. Throughout, let P1 and P2
have densities p1 and p2 on a (common) set X .
R
(a) 2 ∥P1 − P2 ∥TV = |p1 (x) − p2 (x)|dx.
(b) For functions f : X → R, Rdefine the supremum norm ∥f ∥∞ = supx∈X |f (x)|. Show that
2 ∥P1 − P2 ∥TV = sup∥f ∥∞ ≤1 X f (x)(p1 (x) − p2 (x))dx.
R
(c) ∥P1 − P2 ∥TV = max{p1 (x), p2 (x)}dx − 1.
R
(d) ∥P1 − P2 ∥TV = 1 − min{p1 (x), p2 (x)}dx.
Exercise 2.2 (Divergence between multivariate normal distributions): Let P1 be N(θ1 , Σ) and
P2 be N(θ2 , Σ), where Σ ≻ 0 is a positive definite matrix.
43
Lexture Notes on Statistics and Information Theory John Duchi
(b) Show that d2hel (P1 , P2 ) = 1 − exp(− 18 (µ0 − µ1 )⊤ Σ−1 (µ0 − µ1 )).
Exercise 2.3 (The optimal test between distributions): Prove Le-Cam’s inequality: for any
function ψ with dom ψ ⊃ X and any distributions P1 , P2 ,
Thus, the sum of the probabilities of error in a hypothesis testing problem, where based on a sample
X we must decide whether P1 or P2 is more likely, has value at least 1 − ∥P1 − P2 ∥TV . Given P1
and P2 is this risk attainable?
Exercise 2.4: A random variable X has Laplace(λ, µ) distribution if it has density p(x) =
λ
2 exp(−λ|x−µ|). Consider the hypothesis test of P1 versus P2 , where X has distribution Laplace(λ, µ1 )
under P1 and distribution Laplace(λ, µ2 ) under P2 , where µ1 < µ2 . Show that the minimal value
over all tests ψ of P1 versus P2 is
λ
inf P1 (ψ(X) ̸= 1) + P2 (ψ(X) ̸= 2) = exp − |µ1 − µ2 | .
ψ 2
Exercise 2.7 (f -divergences generalize standard divergences): Show the following properties of
f -divergences:
(c) If f (t) = t log t − log t, then Df (P ||Q) = Dkl (P ||Q) + Dkl (Q||P ).
44
Lexture Notes on Statistics and Information Theory John Duchi
(d) For any convex f satisfying f (1) = 0, Df (P ||Q) ≥ 0. (Hint: use Jensen’s inequality.)
(b) Generalizing the preceding result, let a : X → R+ and b : X → R+ , and let µ be a finite
measure on X with respect to which a is integrable. Show that
Z R Z
b(x)dµ(x) b(x)
a(x)dµ(x)f R ≤ a(x)f dµ(x).
a(x)dµ(x) a(x)
If you are unfamiliarR with measure theory, prove the following essentially equivalent result: let
u : X → R+ satisfy u(x)dx < ∞. Show that
Z R Z
b(x)u(x)dx b(x)
a(x)u(x)dxf R ≤ a(x)f u(x)dx
a(x)u(x)dx a(x)
R
whenever a(x)u(x)dx
R < ∞. (It is possible to demonstrate this remains true under appropriate
limits even when a(x)u(x)dx = +∞, but it is a mess.)
(Hint: use the fact that the perspective of a function f , defined by h(x, t) = tf (x/t) for t > 0, is
jointly convex in x and t (see Proposition B.3.12).
Exercise 2.9 (Data processing and f -divergences I): As with the KL-divergence, given a quantizer
g of the set X , where g induces a partition A1 , . . . , Am of X , we define the f -divergence between
P and Q conditioned on g as
m m
P (g −1 ({i}))
X
X P (Ai ) −1
Df (P ||Q | g) := Q(Ai )f = Q(g ({i}))f .
Q(Ai ) Q(g −1 ({i}))
i=1 i=1
Given quantizers g1 and g2 , we say that g1 is a finer quantizer than g2 under the following condition:
assume that g1 induces the partition A1 , . . . , An and g2 induces the partition B1 , . . . , Bm ; then for
any of the sets Bi , there are exists some k and sets Ai1 , . . . , Aik such that Bi = ∪kj=1 Aij . We let
g1 ≺ g2 denote that g1 is a finer quantizer than g2 .
(a) Let g1 and g2 be quantizers of the set X , and let g1 ≺ g2 , meaning that g1 is a finer quantization
than g2 . Prove that
Df (P ||Q | g2 ) ≤ Df (P ||Q | g1 ) .
Equivalently, show that whenever A and B are collections of sets partitioning X , but A is a
finer partition of X than B, that
X
X P (B) P (A)
Q(B)f ≤ Q(A)f .
Q(B) Q(A)
B∈B A∈A
45
Lexture Notes on Statistics and Information Theory John Duchi
(b) Suppose that X is countable (or finite) so that P and Q have p.m.f.s p and q. Show that
X p(x)
Df (P ||Q) = q(x)f ,
x
q(x)
where on the left we are using the partition definition (2.2.3); you should show that the partition
into discrete parts of X achieves the supremum. You may assume that X is finite. (Though
feel free to prove the result in the case that X is infinite.)
Exercise 2.10 (General data processing inequalities): Let f be a convex function satisfying
f (1) = 0. Let K be a Markov transition kernel from X to Z, that is, K(·, x) is a probability
distribution on Z for each x ∈ X . (Written differently, we have X → Z, and conditioned on X = x,
Z has distribution K(·, x), so that K(A, x) is the probability that Z ∈ A given X = x.)
R R
(a) Define the marginals KP (A) = K(A, x)p(x)dx and KQ (A) = K(A, x)q(x)dx. Show that
Df (KP ||KQ ) ≤ Df (P ||Q) .
Hint: by equation (2.2.3), w.l.o.g. we may assume that Z is finite and Z = {1, . . . , m}; also
recall Question 2.8.
(b) Let X and Y be random variables with joint distribution PXY and marginals PX and PY .
Define the f -information between X and Y as
If (X; Y ) := Df (PXY ||PX × PY ) .
Use part (a) to show the following general data processing inequality: if we have the Markov
chain X → Y → Z, then
If (X; Z) ≤ If (X; Y ).
Exercise 2.11 (Convexity of f -divergences): Prove Proposition 2.2.11. Hint: Use Question 2.8.
Exercise 2.12 (Variational forms of KL divergence): Let P and Q be arbitrary distributions on a
common space X . Prove the following variational representation, known as the Donsker-Varadhan
theorem, of the KL divergence:
Dkl (P ||Q) = sup EP [f (X)] − log EQ [exp(f (X))] .
f :EQ [ef (X) ]<∞
Exercise 2.14: Let P1 be N(θ1 , Σ1 ) and P2 be N(θ2 , Σ2 ), where Σi ≻ 0 are positive definite
matrices. Give Dkl (P1 ||P2 ).
Exercise 2.15: Let {Pv }v∈V be an arbitrary collection of distributions on a space X and µ be a
probability measure on V. Show that if V ∼ µ and conditional on V = v, we draw X ∼ Pv , then
46
Lexture Notes on Statistics and Information Theory John Duchi
R R
(a) I(X; V ) = Dkl Pv ||P dµ(v), where P = Pv dµ(v) is the (weighted) average of the Pv . You
may assume that V is discrete if you like.
R
(b) For any distribution
R Q on X , I(X; V ) = Dkl (Pv ||Q) dµ(v)R − Dkl P ||Q . Conclude that
I(X; V ) ≤ Dkl (Pv ||Q) dµ(v), or, equivalently, P minimizes Dkl (Pv ||Q) dµ(v) over all prob-
abilities Q.
Exercise 2.16 (The triangle inequality for variation distance): Let P and Q be distributions
on X1n = (X1 , . . . , Xn ) ∈ X n , and let Pi (· | xi−1
1 ) be the conditional distribution of Xi given
i−1 i−1
X1 = x1 (and similarly for Qi ). Show that
n
X h i
∥P − Q∥TV ≤ EP Pi (· | X1i−1 ) − Qi (· | X1i−1 ) TV
,
i=1
(b) Define the negative binary entropy h(p) = p log p + (1 − p) log(1 − p). Show that
Exercise 2.20: Use the paper “A New Metric for Probability Distributions” by Dominik p Endres
and Johannes Schindelin to prove that if V ∼ Uniform{0, 1} and X | V = v ∼ Pv , then I(X; V )
is a metric on distributions. (Said differently, Djs (P ||Q)1/2 is a metric on distributions, and it
generates the same topology as the TV-distance.)
Exercise 2.21: Relate the generalized Jensen-Shannon divergence between m distributions to
redundancy in encoding.
47
Chapter 3
Our second introductory chapter focuses on readers who may be less familiar with statistical mod-
eling methodology and the how and why of fitting different statistical models. As in the preceding
introductory chapter on information theory, this chapter will be a fairly terse blitz through the main
ideas. Nonetheless, the ideas and distributions here should give us something on which to hang our
hats, so to speak, as the distributions and models provide the basis for examples throughout the
book. Exponential family models form the basis of much of statistics, as they are a natural step
away from the most basic families of distributions—Gaussians—which admit exact computations
but are brittle, to a more flexible set of models that retain enough analytical elegance to permit
careful analyses while giving power in modeling. A key property is that fitting exponential family
models reduces to the minimization of convex functions—convex optimization problems—an oper-
ation we treat as a technology akin to evaluating a function like sin or cos. This perspective (which
is accurate enough) will arise throughout this book, and informs the philosophy we adopt that once
we formulate a problem as convex, it is solved.
In the continuous case, pθ is instead a density on X ⊂ Rk , and pθ takes the identical form above
but Z
A(θ) = log h(x) exp(⟨θ, ϕ(x)⟩)dx.
X
48
Lexture Notes on Statistics and Information Theory John Duchi
We can abstract away from this distinction between discrete and continuous distributions by making
the definition measure-theoretic, which we do here for completeness. (But recall the remarks in
Section 1.4.)
With our notation, we have the following definition.
Definition 3.1. The exponential family associated with the function ϕ and base measure µ is
defined as the set of distributions with densities pθ with respect to µ, where
Θ := {θ | A(θ) < ∞}
is open.
In Definition 3.1, we have included the carrier h in the base measure µ, and frequently we will give
ourselves the general notation
In some scenarios, it may be convient to re-parameterize the problem in terms of some function
η(θ) instead of θ itself; we will not worry about such issues and simply use the formulae that are
most convenient.
We now give a few examples of exponential family models.
Example 3.1.2 (Poisson distribution): The Poisson distribution (for count data) is usually
parameterized by some λ > 0, and for x ∈ N has distribution Pλ (X = x) = (1/x!)λx e− λ. Thus
by taking µ to be counting (discrete) measure on {0, 1, . . .} and setting θ = log λ, we find the
density (probability mass function in this case)
1 x −λ 1 1
p(x) = λ e = exp(x log λ − λ) = exp(xθ − eθ ) .
x! x! x!
Notably, taking h(x) = (x!)−1 and log-partition A(θ) = eθ , we have probability mass function
pθ (x) = h(x) exp(θx − A(θ)). 3
49
Lexture Notes on Statistics and Information Theory John Duchi
Example 3.1.3 (Normal distribution, mean parameterization): For the d-dimensional normal
distribution, we take µ to be Lebesgue measure on Rd . If we fix the covariance and vary only
the mean µ in the family N(µ, Σ), then X ∼ N(µ, Σ) has density
1 ⊤ −1 1
pµ (x) = exp − (x − µ) Σ (x − µ) − log det(2πΣ) .
2 2
In particular, we have carrier h(x) = exp(− 12 x⊤ Σ−1 x)/((2π)d/2 det(Σ)), sufficient statistic
ϕ(x) = x, and log partition A(θ) = 21 θ⊤ Σ−1 θ. 3
Example 3.1.4 (Normal distribution): Let X ∼ N(µ, Σ). We may re-parameterize this as
as Θ = Σ−1 and θ = Σ−1 µ, and we have density
1
pθ,Θ (x) ∝ exp ⟨θ, x⟩ − ⟨xx⊤ , Θ⟩ ,
2
where ⟨·, ·⟩ denotes the Euclidean inner product. See Exercise 3.1. 3
In some cases, it is analytically convenient to include a few more conditions on the exponential
family.
Definition 3.2. Let {Pθ }θ∈Θ be an exponential family as in Definition 3.1. The sufficient statistic
ϕ is minimal if Θ = dom A ⊂ Rd is full-dimensional and there exists no vector u such that
Definition 3.2 is essentially equivalent to stating that ϕ(x) = (ϕ1 (x), . . . , ϕd (x)) has linearly inde-
pendent components when viewed as vectors [ϕi (x)]x∈X . While we do not prove this, via a suitable
linear transformation—a variant of Gram-Schmidt orthonormalization—one may modify any non-
minimal exponential family {Pθ } into an equivalent minimal exponential family {Qη }, meaning
that the two collections satisfy the equality {Pθ } = {Qη } (see Brown [41, Chapter 1]).
50
Lexture Notes on Statistics and Information Theory John Duchi
Here, we enumerate a few of their keyR analytical properties, focusing on the cumulant generating
(or log partition) function A(θ) = log e⟨θ,ϕ(x)⟩ dµ(x). We begin with a heuristic calculation, where
we assume that we exchange differentiation and integration. Assuming that this is the case, we
then obtain the important expectation and covariance relationships that
Z
1
∇A(θ) = R ⟨θ,ϕ(x)⟩ ∇θ e⟨θ,ϕ(x)⟩ dµ(x)
e dµ(x)
Z Z
−A(θ) ⟨θ,ϕ(x)⟩
=e ∇θ e dµ(x) = ϕ(x)e⟨θ,ϕ(x)⟩−A(θ) dµ(x) = Eθ [ϕ(X)]
because e⟨θ,ϕ(x)⟩−A(θ) = pθ (x). A completely similar (and still heuristic, at least at this point)
calculation gives
That these identities hold is no accident and is central to the appeal of exponential family models.
The first and, from our perspective, most important result about exponential family models is
their convexity. While (assuming the differentiation relationships above hold) the differentiation
identity that ∇2 A(θ) = Covθ (ϕ(X)) ⪰ 0 makes convexity of A immediate, one can also provide a
direct argument without appealing to differentiation.
Proposition 3.2.1. The cumulant-generating function θ 7→ A(θ) is convex, and it is strictly convex
if and only if Covθ (ϕ(X)) is positive definite for all θ ∈ dom A.
Proof Let θλ = λθ1 + (1 − λ)θ2 , where θ1 , θ2 ∈ Θ. Then 1/λ ≥ 1 and 1/(1 − λ) ≥ 1, and Hölder’s
inequality implies
Z Z
log exp(⟨θλ , ϕ(x)⟩)dµ(x) = log exp(⟨θ1 , ϕ(x)⟩)λ exp(⟨θ2 , ϕ(x)⟩)1−λ dµ(x)
Z λ Z 1−λ
λ 1−λ
≤ log exp(⟨θ1 , ϕ(x)⟩) dµ(x)
λ exp(⟨θ2 , ϕ(x)⟩) 1−λ dµ(x)
Z Z
= λ log exp(⟨θ1 , ϕ(x)⟩)dµ(x) + (1 − λ) log exp(⟨θ2 , ϕ(x)⟩)dµ(x),
as desired. The strict convexity will be a consequence of Proposition 3.2.2 to come, as there we
formally show that ∇2 A(θ) = Covθ (ϕ(X)).
We now show that A(θ) is indeed infinitely differentiable and how it generates the moments of
the sufficient statistics ϕ(x). To describe the properties, we provide a bit of notation related to
tensor products: for a vector x ∈ Rd , we let
x⊗k := x
| ⊗x⊗ {z· · · ⊗ x}
k times
denote the kth order tensor, or multilinear operator, that for v1 , . . . , vk ∈ Rd satisfies
k
Y
⊗k
x (v1 , . . . , vk ) := ⟨x, v1 ⟩ · · · ⟨x, vk ⟩ = ⟨x, vi ⟩.
i=1
51
Lexture Notes on Statistics and Information Theory John Duchi
When k = 2, this is the familiar outer product x⊗2 = xx⊤ . (More generally, one may think of x⊗k
as a d × d × · · · × d box, where the (i1 , . . . , ik ) entry is [x⊗k ]i1 ,...,ik = xi1 · · · xik .) With this notation,
our first key result regards the differentiability of A, where we can compute (all) derivatives of eA(θ)
by interchanging integration and differentiation.
The proof of the proposition is involved and requires complex analysis, so we defer it to Sec. 3.6.1.
As particular consequences of Proposition 3.2.2, we can rigorously demonstrate the expectation
and covariance relationships that
Z Z
1 ⟨θ,ϕ(x)⟩
∇A(θ) = R ⟨θ,ϕ(x)⟩ ∇e dµ(x) = ϕ(x)pθ (x)dµ(x) = Eθ [ϕ(X)]
e dµ(x)
and
( ϕ(x)e⟨θ,ϕ(x)⟩ dµ(x))⊗2
Z R
2 1 ⊗2 ⟨θ,ϕ(x)⟩
∇ A(θ) = R ϕ(x) e dµ(x) − R
e⟨θ,ϕ(x)⟩ dµ(x) ( e⟨θ,ϕ(x)⟩ dµ(x))2
= Eθ [ϕ(X)ϕ(X)⊤ ] − Eθ [ϕ(X)]Eθ [ϕ(X)]⊤
= Covθ (ϕ(X)).
Minimal exponential families (Definition 3.2) also enjoy a few additional regularity properties.
Recall that A is strictly convex if
Proposition 3.2.3. Let {Pθ } be a regular exponential family. The log partition function A is
strictly convex if and only if {Pθ } is minimal.
Proof If the family is minimal, then Varθ (u⊤ ϕ(X)) > 0 for any vector u, while Varθ (u⊤ ϕ(X)) =
u⊤ ∇2 A(θ)u. This implies the strict positive definiteness ∇2 A(θ) ≻ 0, which is equivalent to strict
convexity (see Corollary B.3.2 in Appendix B.3.1). Conversely, if ∇2 A(θ) ≻ 0 for all θ ∈ Θ, then
Varθ (u⊤ ϕ(X)) > 0 for all u ̸= 0 and so u⊤ ϕ(x) is non-constant in x.
52
Lexture Notes on Statistics and Information Theory John Duchi
This is always a convex optimization problem (see Appendices B and C for much more on this), as A
is convex and the first term is linear, and so has no non-global optima. Here and throughout, as we
mention in the introductory remarks to this chapter, we treat convex optimization as a technology:
as long as the dimension of a problem is not too large and its objective can be evaluated, it is
(essentially) computationally trivial.
Of course, we never have access to the population P fully; instead, we receive a sample
X1 , . . . , Xn from P . In this case, a natural approach is to replace the expected (negative) log
likelihood above with its empirical version and solve
n
X n
X
minimize − log pθ (Xi ) = [−⟨θ, ϕ(Xi )⟩ + A(θ)], (3.2.2)
θ
i=1 i=1
which is still a convex optimization problem (as the objective is convex in θ). The maximum
likelihood estimate is any vector θbn minimizing the negative log likelihood (3.2.2), which by setting
gradients to 0 is evidently any vector satisfying
n
1X
∇A(θbn ) = Eθbn [ϕ(X)] = ϕ(Xi ). (3.2.3)
n
i=1
In particular, we need only find a parameter θbn matching moments of the empirical distribution
of the observed Xi ∼ P . This θbn is unique whenever Covθ (ϕ(X)) ≻ 0 for all θ, that is, when
the covariance of ϕ is full rank in the exponential family model, because then the objective in the
minimization problem (3.2.2) is strictly convex.
Let us proceed heuristically for a moment to develop a rough convergence guarantee for the
estimator θbn ; the next paragraph assumes a comfort with some of classical asymptotic statistics
(and the central limit theorem) and is not essential for what comes later. Then we can see how
minimizers of the problem (3.2.2) converge to their population counterparts. Assume that the data
53
Lexture Notes on Statistics and Information Theory John Duchi
Xi are i.i.d. from an exponential family model Pθ⋆ . Then we expect that the maximum likelihood
estimate θbn should converge to θ⋆ , and so
n
1X
ϕ(Xi ) = ∇A(θbn ) = ∇A(θ⋆ ) + (∇2 A(θ⋆ ) + o(1))(θbn − θ⋆ ).
n
i=1
But of course, ∇A(θ⋆ ) = Eθ⋆ [ϕ(X)], and so the central limit theorem gives that
n
1X ·
(ϕ(Xi ) − ∇A(θ⋆ )) ∼ N 0, n−1 Covθ⋆ (ϕ(X)) = N 0, n−1 ∇2 A(θ⋆ ) ,
n
i=1
·
where ∼ means “is approximately distributed as.” Multiplying by (∇2 A(θ⋆ )+o(1))−1 ≈ ∇2 A(θ⋆ )−1 ,
we thus see (still working in our heuristic)
n
1 X
θbn − θ⋆ = (∇2 A(θ⋆ ) + o(1))−1 (ϕ(Xi ) − ∇A(θ⋆ ))
n
i=1
· −1 2 ⋆ −1
∼ N 0, n · ∇ A(θ ) , (3.2.4)
where we use that BZ ∼ N(0, BΣB ⊤ ) if Z ∼ N(0, Σ). (It is possible to make each of these steps
fully rigorous.) Thus the cumulant generating function A governs the error we expect in θbn − θ⋆ .
Much of the rest of this book explores properties of these types of minimization problems: at
what rates do we expect θbn to converge to a global minimizer of problem (3.2.1)? Can we show
that these rates are optimal? Is this the “right” strategy for choosing a parameter? Exponential
families form a particular working example to motivate this development.
Similarly, we have
Dkl (Pθ+∆ ||Pθ ) = Eθ+∆ [⟨θ + ∆, ϕ(X)⟩ − A(θ + ∆) − ⟨θ, ϕ(X)⟩ + A(θ)]
= A(θ) − A(θ + ∆) + Eθ+∆ [⟨∆, ϕ(X)⟩]
= A(θ) − A(θ + ∆) − ∇A(θ + ∆)⊤ (−∆).
These identities give an immediate connection with convexity. Indeed, for a differentiable convex
function h, the Bregman divergence associated with h is
54
Lexture Notes on Statistics and Information Theory John Duchi
which is always nonnegative, and is the gap between the linear approximation to the (convex)
function h and its actual value. One might more accurately call the quantity (3.3.1) the “first-
order divergence,” which is more evocative, but the statistical, machine learning, and optimization
literatures—in which such divergences frequently appear—have adopted this terminology, so we
stick with it.
JCD Comment: Put in a picture of a Bregman divergence
Proposition 3.3.1. Let {Pθ } be an exponential family model with cumulant generating function
A(θ). Then
When the perturbation ∆ is small, that A is infinitely differentiable then gives that
1
Dkl (Pθ ||Pθ+∆ ) = ∆⊤ ∇2 A(θ)∆ + O(∥∆∥3 ),
2
so that the Hessian ∇2 A(θ) tells quite precisely how the KL divergence changes as θ varies (locally).
As we saw already in Example 2.3.2 (and see the next section), when the KL-divergence between
two distributions is small, it is hard to test between them, and in the sequel, we will show converses
to this. The Hessian ∇2 A(θ⋆ ) also governs the error in the estimate θbn − θ⋆ in our heuristic (3.2.4).
When the Hessian ∇2 A(θ) is quite positive semidefinite, the KL divergence Dkl (Pθ ||Pθ+∆ ) is large,
and the asymptotic covariance (3.2.4) is small. For this—and other reasons we address later—for
exponential family models, we call
55
Lexture Notes on Statistics and Information Theory John Duchi
have a sufficient statistic ϕ : X × Y → Rd , and we model Y | X = x via the generalized linear model
(or conditional exponential family model) if it has density or probability mass function
pθ (y | x) = exp ϕ(x, y)⊤ θ − A(θ | x) h(y), (3.4.1)
The log partition function A(· | x) provides the same insights for the conditional models (3.4.1)
as it does for the unconditional exponential family models in the preceding sections. Indeed, as
in Propositions 3.2.1 and 3.2.2, the log partition A(· | x) is always C ∞ on its domain and convex.
Moreover, it gives the expected moments of the sufficient statistic ϕ conditional on x, as
and
∇2 A(θ | x) = Covθ (ϕ(X, Y ) | X = x),
from which we can (typically) extract the mean or other statistics of Y conditional on x.
Three standard examples will be our most frequent motivators throughout this book: linear
regression, binary logistic regression, and multiclass logistic regression. We give these three, as
well as describing two more important examples involving modeling count data through Poisson
regression and making predictions for targets y known to live in a bounded set.
so that we have the exponential family representation (3.4.1) with ϕ(x, y) = σ12 xy, h(y) =
exp(− 2σ1 2 y 2 + 21 log(2πσ 2 )), and A(θ) = 2σ1 2 θ⊤ xx⊤ θ. As ∇A(θ | x) = Eθ [ϕ(X, Y ) | X = x] =
1
σ2
xEθ [Y | X = x], we easily recover Eθ [Y | X = x] = θ⊤ x. 3
Frequently, we wish to predict binary or multiclass random variables Y . For example, consider
a medical application in which we wish to assess the probability that, based on a set of covariates
x ∈ Rd (say, blood pressure, height, weight, family history) and individual will have a heart attack
in the next 5 years, so that Y = 1 indicates heart attack and Y = −1 indicates not. The next
example shows how we might model this.
56
Lexture Notes on Statistics and Information Theory John Duchi
Here, the idea is that if θy⊤ x > θj⊤ x for all j ̸= y, then the model assigns higher probability to
class y than any other class; the larger the gap between θy⊤ x and θj⊤ x, the larger the difference
in assigned probabilities. 3
Other approaches with these ideas allow us to model other situations. Poisson regression models
are frequent choices for modeling count data. For example, consider an insurance company that
wishes to issue premiums for shipping cargo in different seasons and on different routes, and so
wishes to predict the number of times a given cargo ship will be damaged by waves over a period
of service; we might represent this with a feature vector x encoding information about the ship to
be insured, typical weather on the route it will take, and the length of time it will be in service.
To model such counts Y ∈ {0, 1, 2, . . .}, we turn to Poisson regression.
Example 3.4.4 (Poisson regression): When Y ∈ N is a count, the Poisson distribution with
−λ y ⊤
rate λ > 0 gives P (Y = y) = e y!λ . Poisson regression models λ via eθ x , giving model
1 ⊤
pθ (y | x) = exp yx⊤ θ − eθ x ,
y!
so that we have carrier h(y) = 1/y! and the simple sufficient statistic yx⊤ θ. The log partition
⊤
function is A(θ | x) = eθ x . 3
57
Lexture Notes on Statistics and Information Theory John Duchi
Lastly, we consider a less standard example, but which highlights the flexibility of these models.
Here, we assume a linear regression problem but in which we wish to predict values Y in a bounded
range.
Example 3.4.5 (Bounded range regression): Suppose that we know Y ∈ [−b, b], but we wish
to model it via an exponential family model with density
where the limit as s → 0 is 2b; the (conditional) log partition function is thus
bθ ⊤ x −e−bθ ⊤ x
(
log e θ⊤ x
if θ⊤ x ̸= 0
A(θ | x) =
log(2b) otherwise.
While its functional form makes this highly non-obvious, our general results guarantee that
A(θ | x) is indeed C ∞ and convex in θ. We have ∇A(θ | x) = xEθ [Y | X = x] because
ϕ(x, y) = xy, and we can therefore immediately recover Eθ [Y | X = x]. Indeed, set s = θ⊤ x,
and without loss of generality assume s ̸= 0. Then
This is a convex optimization problem with C ∞ objective, so we can treat solving it as an (essen-
tially) trivial problem unless the sample size n or dimension d of θ are astronomically large.
As in the moment matching equality (3.2.3), a necessary and sufficient condition for θbn to
minimize the above objective is that it achieves 0 gradient, that is,
n n
1X 1X
∇A(θbn | Xi ) = ϕ(Xi , Yi ).
n n
i=1 i=1
Once again, to find θbn amounts to matching moments, as ∇A(θ | Xi ) = E[ϕ(X, Y ) | X = Xi ], and
we still enjoy the convexity properties of the standard exponential family models.
In general, we of course do not expect any exponential family or generalized linear model (GLM)
to have perfect fidelity to the world: all models are inaccurate (but many are useful!). Nonetheless,
58
Lexture Notes on Statistics and Information Theory John Duchi
we can still fit any of the GLM models in Examples 3.4.1–3.4.5 to data of the appropriate type. In
particular, for the logarithmic loss ℓ(θ; x, y) = − log pθ (y | x), we can define the empirical loss
n
1X
Ln (θ) := ℓ(θ; Xi , Yi ).
n
i=1
Then, as n → ∞, we expect that Ln (θ) → E[ℓ(θ; X, Y )], so that the minimizing θ should give the
best predictions possible according to the loss ℓ. We shall therefore often be interested in such
convergence guarantees and the deviations of sample quantities (like Ln ) from their population
counterparts.
and similarly
Dkl (Pθ+∆ ◦ Q||Pθ ◦ Q) = Eθ+∆ DA(·|X) (θ, θ + ∆) ,
where we recall the Bregman divergence (3.3.1) and have used that Eθ [ϕ(X, Y ) | X] = ∇A(θ | X).
Performing a Taylor expansion, we have
1
A(θ + ∆ | x) = A(θ | x) + ⟨∇A(θ | x), ∆⟩ + ∆⊤ ∇2 A(θ | x)∆ + O(Eθ+t∆ [∥ϕ(X, Y )∥3 | x] · ∥∆∥3 ),
2
where we have computed third derivatives of A(θ | x), and t ∈ [0, 1]. Evaluating the Taylor
expansion in the integral form, once there exists some δ > 0 such that
Z 1 h i
3
EQ Eθ+tv [∥ϕ(X, Y )∥ | X] dt < ∞
0
for any ∥v∥2 ≤ δ, for either of the expansions above we have the following corollary:
Corollary 3.4.6. Assume the marginal distribution Q on X satisfies the above integrability con-
dition. Then as ∆ → 0,
1
Dkl (Pθ ◦ Q||Pθ+∆ ◦ Q) = ∆⊤ EQ [∇2 A(θ | X)]∆ + O(∥∆∥3 )
2
and
1
Dkl (Pθ+∆ ◦ Q||Pθ ◦ Q) = ∆⊤ EQ [∇2 A(θ | X)]∆ + O(∥∆∥3 ).
2
59
Lexture Notes on Statistics and Information Theory John Duchi
In analogy with Proposition 3.3.1, we see again that the expected Hessian EQ [∇2 A(θ | X)] tells
quite precisely how the KL divergence changes as θ varies locally, but now, the distribution Q on
X also enters the picture. So when A and the distribution Q are such that EQ [∇2 A(θ | X)] is large
in the semidefinite order, then it is easy to distinguish data coming from Pθ ◦ Q from that drawn
from Pθ′ ◦ Q, and otherwise, it is not. We therefore call
E[∇2 A(θ | X)] = E[Covθ (ϕ(X, Y ) | X)] = E[∇ log pθ (Y | X)∇ log pθ (Y | X)] (3.4.2)
Example 3.4.7 (The information in logistic regression): For the binary logistic regression
model (Example 3.4.2) with Y ∈ {0, 1}, we have
⊤
ex θ
∇ log pθ (y | x) = yx − x = (y − pθ (1 | x))x
1 + ex⊤ θ
Example 3.4.8 (The KL-divergence in logistic regression): The binary logistic regression
model with Y ∈ {0, 1} also admits simple bounds on its KL-divergence. For these, we first
make the simple observation that for the log-sum-exp function f (t) = log(1 + et ), we have
et
f ′ (t) = 1+e ′′ ′ ′
t and f (t) = f (t)(1 − f (t)). Taylor’s theorem states that
Z ∆
′
f (t + ∆) = f (t) + f (t)∆ + f ′′ (t + u)(∆ − u)du,
0
∆2
R∆ R∆
and as 0 ≤ f ′′ ≤ 14 , we have | 0 f ′′ (t + u)(∆ − u)du| ≤ 1
4 0 |∆ − u|du = 8 , so
∆2
|f (t + ∆) − f (t) − f ′ (t)∆| ≤
8
for all ∆ ∈ R. Computing the KL-divergence directly, we thus have for any parameters θ0 , θ1
that
1 ⊤ 2
Dkl (Pθ0 (· | x)||Pθ1 (· | x)) = f (θ0⊤ x) − f (θ1⊤ x) − f ′ (θ1⊤ x)x⊤ (θ0 − θ1 ) ≤ x (θ0 − θ1 ) ,
8
so the divergence is at most quadratic. 3
60
Lexture Notes on Statistics and Information Theory John Duchi
H0 : θ = θ0 versus H1,n : θ = θn
as n grows, where we observe a sample X1n drawn i.i.d. either according to Pθ0 (i.e., H0 ) or Pθn
(i.e., H1,n ). By choosing θn in a way that makes the separation v ⊤ (θn − θ0 ) large but testing H0
against H1,n challenging, we can then (roughly) identify the separation δ at which testing becomes
impossible.
Proposition 3.5.1. Let θ0 ∈ Rd . Then there exists a sequence of parameters θn with ∥θn − θ0 ∥ =
√
O(1 n), separation
1
q
v ⊤ (θn − θ0 ) = √ v ⊤ ∇2 A(θ0 )−1 v,
n
and for which
1
inf {Pθ0 (Ψ(X1n ) ̸= 0) + Pθn (Ψ(X1n ) ̸= 1)} ≥ + O(n−1/2 ).
Ψ 2
Proof Let ∆ ∈ Rd be a potential perturbation to θ1 = θ0 + ∆, which gives separation δ =
v ⊤ θ1 − v ⊤ θ0 = v ⊤ ∆. Let P0 = Pθ0 and P1 = Pθ1 . Then the smallest summed probability of error
in testing between P0 and P1 based on n observations X1n is
by Proposition 2.3.1. Following the approach of Example 2.3.2, we apply Pinsker’s inequal-
ity (2.2.10) and use that the KL-divergence tensorizes to find
2 ∥P0n − P1n ∥2TV ≤ nDkl (P0 ||P1 ) = nDkl (Pθ0 ||Pθ0 +∆ ) = nDA (θ0 + ∆, θ0 ),
where the final equality follows from the equivalence between KL and Bregman divergences for
exponential families (Proposition 3.3.1).
61
Lexture Notes on Statistics and Information Theory John Duchi
To guarantee that the summed probability of error is at least 21 , that is, ∥P0n − P1n ∥TV ≤ 12 ,
it suffices to choose ∆ satisfying nDA (θ0 + ∆, θ0 ) ≤ 21 . So to maximize the separation v ⊤ ∆ while
guaranteeing a constant probability of error, we (approximately) solve
maximize v ⊤ ∆
1
subject to DA (θ0 + ∆, θ0 ) ≤ 2n .
3
Now, consider that DA (θ0 + ∆, θ0 ) = 12 ∆⊤ ∇2 A(θ0 )∆ + O(∥∆∥ ). Ignoring the higher order term,
we consider maximizing v ⊤ ∆ subject to ∆⊤ ∇2 A(θ0 )∆ ≤ n1 . A Lagrangian calculation shows that
this has solution
1 1
∆= √ p ∇2 A(θ0 )−1 v.
n v ⊤ ∇2 A(θ0 )−1 v
p
With this choice, we have separation δ = v ⊤ ∆ = v ⊤ ∇2 A(θ0 )−1 v/n, and DA (θ0 + ∆, θ0 ) =
1 3/2
2n + O(1/n ). The summed probability of error is at least
r r
n 1 1
n n
1 − ∥P0 − P1 ∥TV ≥ 1 − + O(n −1/2 )=1− + O(n−1/2 ) = + O(n−1/2 )
4n 4 2
as desired.
Let us briefly sketch out why Proposition 3.5.1 is the “right” answer using the heuristics in Sec-
tion 3.2.1. For an unknown parameter θ in the exponential family model Pθ , we observe X1 , . . . , Xn ,
and wish to test whether v ⊤ θ ≥ t for a given threshold t. Call our null H0 : v ⊤ θ ≤ t, and assume
we wish to test at an asymptotic level α > 0, meaning the probability the test falsely rejects H0 is
(as n → ∞) is at most α. Assuming the heuristic (3.2.4), we have the approximate distributional
equality
· 1
v ⊤ θbn ∼ N v ⊤ θ, v ⊤ ∇2 A(θbn )−1 v .
n
Note that we have θbn on the right side of the distribution; it is possible to make this rigorous, but
here we target only intuition building. A natural asymptotically level α test is then
( q
Reject if v ⊤ θbn ≥ t + z1−α v ⊤ ∇2 A(θbn )−1 v/n
Tn :=
Accept otherwise,
where z1−α is the 1 − α quantile of a standard normal, P(Z ≥ z1−α ) = α for Z ∼ N(0, 1). Let θ0
be such that v ⊤ θ0 = t, so H0 holds. Then
√
q
⊤ b ⊤ 2 −1
Pθ0 (Tn rejects) = Pθ0 n · v (θn − θ0 ) ≥ z1−α v ∇ A(θn ) v → α.
b
p √
At least heuristically, then, this separation δ = v ⊤ A(θ0 )−1 v/ n is the fundamental separation
in parameter values at which testing becomes possible (or below which it is impossible).
As a brief and suggestive aside, the precise growth of the KL-divergence Dkl (Pθ0 +∆ ||Pθ0 ) =
1 ⊤ 2 3
2 ∆ ∇ A(θ0 )∆ + O(∥∆∥ ) near θ0 plays the fundamental role in both the lower bound and upper
bound on testing. When the Hessian ∇2 A(θ0 ) is “large,” meaning it is very positive definite,
distributions with small parameter distances are still well-separated in KL-divergence, making
testing easy, while when ∇2 A(θ0 ) is small (nearly indefinite), the KL-divergence can be small even
for large parameter separations ∆ and testing is hard. As a consequence, at least for exponential
family models, the Fisher information (3.3.2), which we defined as ∇2 A(θ) = Covθ (ϕ(X)), plays a
central role in testing and, as we see later, estimation.
62
Lexture Notes on Statistics and Information Theory John Duchi
Lemma 3.6.1. Consider any collection {θ1 , . . . , θm } ⊂ Θ, and let Θ0 = Conv{θi }m i=1 and C ⊂
int Θ0 . Then for any k ∈ N, there exists a constant K = K(C, k, {θi }) such that for all θ0 ∈ C,
Proof Let B = {u ∈ Rd | ∥u∥ ≤ 1} be the unit ball in Rd . For any ϵ > 0, there exists a K = K(ϵ)
such that ∥x∥k ≤ Keϵ∥x∥ for all x ∈ Rd . As C ⊂ int Conv(Θ0 ), there exists an ϵ > 0 such P that for
all θ0 ∈ C, θ0 + 2ϵB ⊂ Θ0 , and by construction, for any u ∈ B we can write θ0 + 2ϵu = m j=1 λj θj
m ⊤
for some λ ∈ R+ with 1 λ = 1. We therefore have
But using the convexity of t 7→ exp(t) and that θ0 + 2ϵu ∈ Θ0 , the last quantity has upper bound
Lemma 3.6.2. Under the conditions of Lemma 3.6.1, there exists a K such that for any θ, θ0 ∈ C
Proof We write
exp(⟨θ, x⟩) − exp(⟨θ0 , x⟩) exp(⟨θ − θ0 , x⟩) − 1
= exp(⟨θ0 , x⟩)
∥θ − θ0 ∥ ∥θ − θ0 ∥
1
For complex functions, Osgood’s lemma shows that if A is continuous and holomorphic in each variable individ-
ually, it is holomorphic. For a treatment of such ideas in an engineering context, see, e.g. [101, Ch. 1].
63
Lexture Notes on Statistics and Information Theory John Duchi
From this, we can assume without loss of generality that θ0 = 0 (by shifting). Now note that
by convexity e−a ≥ 1 − a for all a ∈ R, so 1 − ea ≤ |a| when a ≤ 0. Conversely, if a > 0, then
d
aea ≥ ea − 1 (note that da (aea ) = aea + ea ≥ ea ), so dividing by ∥x∥, we see that
as desired.
With the lemmas in hand, we can demonstrate a dominating function for the derivatives. Indeed,
fix θ0 ∈ int Θ and for θ ∈ Θ, define
exp(⟨θ, x⟩) − exp(⟨θ0 , x⟩) − exp(⟨θ0 , x⟩)⟨x, θ − θ0 ⟩ e⟨θ,x⟩ − e⟨θ0 ,x⟩ − ⟨∇e⟨θ0 ,x⟩ , θ − θ0 ⟩
g(θ, x) = = .
∥θ − θ0 ∥ ∥θ − θ0 ∥
Then limθ→θ0 g(θ, x) = 0 by the differentiability of t 7→ et . Lemmas 3.6.1 and 3.6.2 show that if
we take any collection {θj }m
j=1 ⊂ Θ for which θ ∈ int Conv{θj }, then for C ⊂ int Conv{θj }, there
exists a constant K such that
| exp(⟨θ, x⟩) − exp(⟨θ0 , x⟩)|
|g(θ, x)| ≤ + ∥x∥ exp(⟨θ0 , x⟩) ≤ K max exp(⟨θj , x⟩)
∥θ − θ0 ∥ j
Pm
for all θ ∈ C. As maxj e⟨θj ,x⟩ dµ(x) ≤ ⟨θj ,x⟩ dµ(x) < ∞, the dominated convergence
R R
j=1 e
theorem thus implies that Z
lim g(θ, x)dµ(x) = 0,
θ→θ0
√
Analyticity Over the subset ΘC := {θ + iz | θ ∈ Θ, z ∈ Rd } (where i = −1 is the imaginary
unit), we can extend the preceding results to demonstrate that A is analytic on ΘC . Indeed, we
first simply note that for a, b ∈ R, exp(a + ib) = exp(a) exp(ib) and | exp(a + ib)| = exp(a), i.e.
|ez | = e z for z ∈ C, and so Lemmas 3.6.1 and 3.6.2 follow mutatis-mutandis as in the real case.
These are enough for the application of the dominated convergence theorem above, and we use that
exp(·) is analytic to conclude that θ 7→ M (θ) is analytic on ΘC .
64
Lexture Notes on Statistics and Information Theory John Duchi
3.7 Bibliography
3.8 Exercises
Exercise 3.1: In Example 3.1.4, give the sufficient statistic ϕ and an explicit formula for the log
partition function A(θ, Θ) so that we can write pθ,Θ (x) = exp(⟨θ, ϕ1 (x)⟩ + ⟨Θ, ϕ2 (x)⟩ − A(θ, Θ)).
Exercise 3.2: Consider the binary logistic regression model in Example 3.4.2, and let ℓ(θ; x, y) =
− log pθ (y | x) be the associated log loss.
(ii) Let (xi , yi )ni=1 ⊂ Rd × {±1} be a sample. Give a sufficient condition for the minimizer of the
empirical log loss
n
1X
Ln (θ) := ℓ(θ; xi , yi )
n
i=1
to be unique that depends only on the vectors {xi }. Hint. A convex function h is strictly
convex if and only if its Hessian ∇2 h is positive definite.
Exercise 3.3: Give the Fisher information (3.4.2) for each of the following generalized linear
models:
65
Part I
66
Chapter 4
Concentration Inequalities
In many scenarios, it is useful to understand how a random variable X behaves by giving bounds
on the probability that it deviates far from its mean or median. This can allow us to give prove
that estimation and learning procedures will have certain performance, that different decoding and
encoding schemes work with high probability, among other results. In this chapter, we give several
tools for proving bounds on the probability that random variables are far from their typical values.
We conclude the section with a discussion of basic uniform laws of large numbers and applications
to empirical risk minimization and statistical learning, though we focus on the relatively simple
cases we can treat with our tools.
P(X ≥ t)?
We begin with the three most classical three inequalities for this purpose: the Markov, Chebyshev,
and Chernoff bounds, which are all instances of the same technique.
The basic inequality off of which all else builds is Markov’s inequality.
Proposition 4.1.1 (Markov’s inequality). Let X be a nonnegative random variable, meaning that
X ≥ 0 with probability 1. Then
E[X]
P(X ≥ t) ≤ .
t
Proof For any random variable, P(X ≥ t) = E[1 {X ≥ t}] ≤ E[(X/t)1 {X ≥ t}] ≤ E[X]/t, as
X/t ≥ 1 whenever X ≥ t.
When we know more about a random variable than that its expectation is finite, we can give
somewhat more powerful bounds on the probability that the random variable deviates from its
typical values. The first step in this direction, Chebyshev’s inequality, requires two moments, and
when we have exponential moments, we can give even stronger results. As we shall see, each of
these results is but an application of Proposition 4.1.1.
67
Lexture Notes on Statistics and Information Theory John Duchi
Proposition 4.1.2 (Chebyshev’s inequality). Let X be a random variable with Var(X) < ∞. Then
Var(X) Var(X)
P(X − E[X] ≥ t) ≤ and P(X − E[X] ≤ −t) ≤
t2 t2
for all t ≥ 0.
Proof We prove only the upper tail result, as the lower tail is identical. We first note that
X − E[X] ≥ t implies that (X − E[X])2 ≥ t2 . But of course, the random variable Z = (X − E[X])2
is nonnegative, so Markov’s inequality gives P(X − E[X] ≥ t) ≤ P(Z ≥ t2 ) ≤ E[Z]/t2 , and
E[Z] = E[(X − E[X])2 ] = Var(X).
E[eλX ]
P(X ≥ t) ≤ = φX (λ)e−λt
eλt
for all λ ≥ 0.
Proof This is another application of Markov’s inequality: for λ > 0, we have eλX ≥ eλt if and
only if X ≥ t, so that P(X ≥ t) = P(eλX ≥ eλt ) ≤ E[eλX ]/eλt .
In particular, taking the infimum over all λ ≥ 0 in Proposition 4.1.3 gives the more standard
Chernoff (large deviation) bound
P(X ≥ t) ≤ exp inf log φX (λ) − λt .
λ≥0
λ2 σ 2
φX (λ) = E[exp(λX)] = exp . (4.1.1)
2
68
Lexture Notes on Statistics and Information Theory John Duchi
As a consequence of the equality (4.1.1) and the Chernoff bound technique (Proposi-
tion 4.1.3), we see that for X Gaussian with variance σ 2 , we have
t2 t2
P(X ≥ E[X] + t) ≤ exp − 2 and P(X ≤ E[X] − t) ≤ exp − 2
2σ 2σ
λ2 σ 2 2 2 2
for all t ≥ 0. Indeed, we have log φX−E[X] (λ) = 2 , and inf λ { λ 2σ − λt} = − 2σ
t
2 , which is
attained by λ = σt2 . 3
Example 4.1.5 (Random signs (Rademacher variables)): The random variable X taking
values {−1, 1} with equal property is 1-sub-Gaussian. Indeed, we have
∞ ∞ ∞ ∞
1 X λk 1 X (−λ)k λ2k (λ2 )k
2
1 1 X X λ
E[exp(λX)] = eλ + e−λ = + = ≤ = exp ,
2 2 2 k! 2 k! (2k)! 2k k! 2
k=0 k=0 k=0 k=0
as claimed. 3
Bounded random variables are also sub-Gaussian; indeed, we have the following example.
Example 4.1.6 (Bounded random variables): Suppose that X is bounded, say X ∈ [a, b].
Then Hoeffding’s lemma states that
λ2 (b − a)2
E[eλ(X−E[X]) ] ≤ exp ,
8
so that X is (b − a)2 /4-sub-Gaussian.
We prove a somewhat weaker statement with a simpler argument, while Exercise 4.1 gives
one approach to proving the above statement. First, let ε ∈ {−1, 1} be a Rademacher variable,
so that P(ε = 1) = P(ε = −1) = 21 . We apply a so-called symmetrization technique—a
common technique in probability theory, statistics, concentration inequalities, and Banach
space research—to give a simpler bound. Indeed, let X ′ be an independent copy of X, so that
E[X ′ ] = E[X]. We have
= E exp(λε(X − X ′ )) ,
69
Lexture Notes on Statistics and Information Theory John Duchi
where the inequality follows from Jensen’s inequality and the last equality is a conseqence of
the fact that X − X ′ is symmetric about 0. Using the result of Example 4.1.5,
λ (X − X ′ )
2 2
λ (b − a)2
′
E exp(λε(X − X )) ≤ E exp ≤ exp ,
2 2
where the final inequality is immediate from the fact that |X − X ′ | ≤ b − a. 3
While Example 4.1.6 shows how a symmetrization technique can give sub-Gaussian behavior,
more sophisticated techniques involving explicitly bounding the logarithm of the moment generating
function of X, often by calculations involving exponential tilts of its density. In particular, letting
X be mean zero for simplicity, if we let
then
E[XeλX ] E[X 2 eλX ] E[XeλX ]2
ψ ′ (λ) = and ψ ′′
(λ) = − ,
E[eλX ] E[eλX ] E[eλX ]2
where we can interchange the order of taking expectations and derivatives whenever ψ(λ) is finite.
Notably, if X has density pX (with respect to any base measure) then the random variable Yλ with
density
eλy
pλ (y) = pX (y)
E[eλX ]
(with respect to the same base measure) satisfies
One can exploit this in many ways, which the exercises and coming chapters do. As a particular
example, we can give sharper sub-Gaussian constants for Bernoulli random variables.
Example 4.1.7 (Bernoulli random variables): Let X be Bernoulli(p), so that X = 1 with
probability p and X = 0 otherwise. Then a strengthening of Hoeffding’s lemma (also, essen-
tially, due to Hoeffding) is that
σ 2 (p) 2 1 − 2p
log E[eλ(X−p) ] ≤ λ for σ 2 (p) := .
2 2 log 1−p
p
Here we take the limits as p → {0, 21 , 1} and have σ 2 (0) = 0, σ 2 (1) = 0, and σ 2 ( 12 ) = 14 .
Because p 7→ σ 2 (p) is concave and symmetric about p = 12 , this inequality is always sharper
than that of Example 4.1.6. Exercise 4.12 gives one proof of this bound exploiting exponential
tilting. 3
Chernoff bounds for sub-Gaussian random variables are immediate; indeed, they have the same
concentration properties as Gaussian random variables, a consequence of the nice analytical prop-
erties of their moment generating functions (that their logarithms are at most quadratic). Thus,
using the technique of Example 4.1.4, we obtain the following proposition.
Proposition 4.1.8. Let X be a σ 2 -sub-Gaussian. Then for all t ≥ 0 we have
t2
P(X − E[X] ≥ t) ∨ P(X − E[X] ≤ −t) ≤ exp − 2 .
2σ
70
Lexture Notes on Statistics and Information Theory John Duchi
Chernoff bounds extend naturally to sums of independent random variables, because moment
generating functions of sums of independent random variables become products of moment gener-
ating functions.
Proof We assume w.l.o.g. that the Xi are mean zero. We have by independence that and
sub-Gaussianity that
Xn n−1
X 2 2 n−1
X
λ σn
E exp λ Xi = E exp λ Xi E[exp(λXn )] ≤ exp E exp λ Xi .
2
i=1 i=1 i=1
Two immediate corollary to Propositions 4.1.8 and 4.1.9 show that sums of sub-Gaussian random
variables concentrate around their expectations. We begin with a general concentration inequality.
Corollary 4.1.10. Let Xi be independent σi2 -sub-Gaussian random variables. Then for all t ≥ 0
( n n )
t2
X X
max P (Xi − E[Xi ]) ≥ t , P (Xi − E[Xi ]) ≤ −t ≤ exp − Pn .
i=1 i=1
2 i=1 σi2
Additionally, the classical Hoeffding bound, follows when we couple Example 4.1.6 with Corol-
lary 4.1.10: if Xi ∈ [ai , bi ], then
n
2t2
X
P (Xi − E[Xi ]) ≥ t ≤ exp − Pn 2
.
i=1 i=1 (bi − ai )
To give another interpretation of these inequalities, let us assume that Xi are indepenent and
σ 2 -sub-Gaussian. Then we have that
n
nt2
X
1
P (Xi − E[Xi ]) ≥ t ≤ exp − 2 ,
n 2σ
i=1
q
1
nt2 2σ 2 log δ
or, for δ ∈ (0, 1), setting exp(− 2σ 2) = δ or t = √
n
, we have that
q
1X
n 2σ 2 log 1δ
(Xi − E[Xi ]) ≤ √ with probability at least 1 − δ.
n n
i=1
There are a variety of other conditions equivalent to sub-Gaussianity, which we capture in the
following theorem.
71
Lexture Notes on Statistics and Information Theory John Duchi
Theorem 4.1.11. Let X be a random variable and σ 2 ≥ 0. The following statements are all
equivalent, meaning that there are numerical constant factors Kj such that if one statement (i) holds
with parameter Ki , then statement (j) holds with parameter Kj ≤ CKi , where C is a numerical
constant.
2
(1) Sub-gaussian tails: P(|X| ≥ t) ≤ 2 exp(− Kt1 σ2 ) for all t ≥ 0.
√
(2) Sub-gaussian moments: E[|X|k ]1/k ≤ K2 σ k for all k.
Particularly,
q (1) implies (2) with K1 = 1 and K2 ≤ e1/e ; (2) implies (3) with K2 = 1 and
2
K3 = e e−1 < 3; (3) implies (1) with K3 = 1 and K1 = 1/ log 2. For the last part, (3) implies
(4) with K3 = 1 and K4 ≤ 34 , while (4) implies (1) with K4 = 1
2 and K1 ≤ 2.
This result is standard in the literature on concentration and random variables; see Section 4.4.1
for a proof.
For completeness, we can give a tighter result than part (3) of the preceding theorem, giving a
concrete upper bound on squares of sub-Gaussian random variables. The technique used in the ex-
ample, to introduce an independent random variable for auxiliary randomization, is a common and
useful technique in probabilistic arguments (similar to our use of symmetrization in Example 4.1.6).
√ (i) (ii) 1
E[exp(λX 2 )] = E[exp( 2λXZ)] ≤ E exp(λσ 2 Z 2 ) =
1 ,
[1 − 2σ 2 λ]+2
where inequality (i) follows because X is sub-Gaussian, and (ii) because Z ∼ N(0, 1). 3
72
Lexture Notes on Statistics and Information Theory John Duchi
where inequality (⋆) holds for λ ≤ 14 , because − log(1 − 2λ) ≤ 2λ + 4λ2 for λ ≤ 14 . 3
As a second example, we can show that bounded random variables are sub-exponential. It is
clear that this is the case as they are also sub-Gaussian; however, in many cases, it is possible to
show that their parameters yield much tighter control over deviations than is possible using only
sub-Gaussian techniques.
Example 4.1.14 (Bounded random variables are sub-exponential): Suppose that X is a
mean zero random variable taking values in [−b, b] with variance σ 2 = E[X 2 ] (note that we are
guaranteed that σ 2 ≤ b2 in this case). We claim that
2 2
3λ σ 1
E[exp(λX)] ≤ exp for |λ| ≤ . (4.1.4)
5 2b
To see this, we expand ez via
∞ ∞
z 2 X 2z k−2 z2 X 2
ez = 1 + z + =1+z+ zk .
2 k! 2 (k + 2)!
k=2 k=0
P∞ P∞
For k ≥ 0, we have 2
(k+2)! ≤ 1
3k
, so that 2
k=0 (k+2)! z
k ≤ k=0 |z/3|
k = [1 − |z|/3]−1
+ . Thus
1 z2
ez ≤ 1 + z + ,
[1 − |z|/3]+ 2
3
and as |X| ≤ b and |λ| < we therefore obtain
b
λ2 X 2 λ2 σ 2
1
E[exp(λX)] ≤ 1 + E[λX] + E ≤1+ .
2 [1 − |λX|/3]+ 1 − |λ|b/3 2
73
Lexture Notes on Statistics and Information Theory John Duchi
1 1
Letting |λ| ≤ 2b implies 1−|λ|b/3 ≤ 56 , and using that 1 + x ≤ ex gives the result.
It is possible to give a slightly tighter result for λ ≥ 0 In this case, we have the bound
∞
λ2 σ 2 X λk−2 bk−2 σ2
E[exp(λX)] ≤ 1 + + λ2 σ 2 = 1 + 2 eλb − 1 − λb .
2 k! b
k=3
Then using that 1 + x ≤ ex ,
we obtain Bennett’s moment generating inequality, which is that
2
λX σ λb
E[e ] ≤ exp e − 1 − λb for λ ≥ 0. (4.1.5)
b2
λ2 b2
Inequality (4.1.5) always holds, and for λb near 0, we have eλb − 1 − λb ≈ 2 . 3
In particular, if the variance σ 2 ≪ b2 , the absolute bound on X, inequality (4.1.4) gives much
tighter control on the moment generating function of X than typical sub-Gaussian bounds based
only on the fact that X ∈ [−b, b] allow.
More broadly, we can show a result similar to Theorem 4.1.11.
Theorem 4.1.15. Let X be a random variable and σ ≥ 0. Then—in the sense of Theorem 4.1.11—
the following statements are all equivalent for suitable numerical constants K1 , . . . , K4 .
(1) Sub-exponential tails: P(|X| ≥ t) ≤ 2 exp(− Kt1 σ ) for all t ≥ 0
Comparing with sub-Gaussian random variables, which have b = 0, we see that Proposition 4.1.16
gives a similar result for small t—essentially the same concentration sub-Gaussian random variables—
while for large t, the tails decrease only exponentially in t.
We can also give a tensorization identity similar to Proposition 4.1.9.
74
Lexture Notes on Statistics and Information Theory John Duchi
Proof We apply an inductive technique similar to that used in the proof of Proposition 4.1.9.
1
First, for any fixed i, we know that if |λ| ≤ bi |ai|
, then |ai λ| ≤ b1i and so
λ2 a2i σi2
E[exp(λai Xi )] ≤ exp .
2
1
Now, we inductively apply the preceding inequality, which applies so long as |λ| ≤ bi |ai | for all i.
We have
n n n
" X # Y 2 2 2
Y λ ai σi
E exp λ ai Xi = E[exp(λai Xi )] ≤ exp ,
2
i=1 i=1 i=1
It is instructive to study the structure of the bound of Corollary 4.1.18. Notably, the bound
is similar to the Hoeffding-type bound of Corollary 4.1.10 (holding for σ 2 -sub-Gaussian random
variables) that
n
!
t2
X
P ai Xi ≥ t ≤ exp − 2 2 ,
i=1
2 ∥a∥ 2 σ
so that for small t, Corollary 4.1.18 gives sub-Gaussian tail behavior. For large t, the bound is
weaker. However, in many cases, Corollary 4.1.18 can give finer control than naive sub-Gaussian
bounds. Indeed, suppose that the random variables Xi are i.i.d., mean zero, and satisfy Xi ∈ [−b, b]
with probability 1, but have variance σ 2 = E[Xi2 ] ≤ b2 as in Example 4.1.14. Then Corollary 4.1.18
implies that
n
( )!
5 t2
X
1 t
P ai Xi ≥ t ≤ exp − min , . (4.1.6)
2 6 σ 2 ∥a∥22 2b ∥a∥∞
i=1
75
Lexture Notes on Statistics and Information Theory John Duchi
When applied to a standard mean (and with a minor simplification that 5/12 < 1/3) with ai = n1 ,
t2
we obtain the bound that n1 ni=1 Xi ≤ t with probability at least 1−exp(−n min{ 3σ t
P
2 , 4b }). Written
q
3 log 1δ 4b log 1δ
differently, we take t = max{σ n , n } to obtain
q
1X
n 3 log 1δ 4b log 1
δ
Xi ≤ max σ √ , with probability 1 − δ.
n n n
i=1
q √
The sharpest such bound possible via more naive Hoeffding-type bounds is b 2 log 1δ / n, which
has substantially worse scaling.
The exercises ask you to work out further variants of these results, including the sub-exponential
behavior of quadratic forms of Gaussian random vectors. As one particular example, Exercises 4.10
and 4.11 work through the details of proving the following corollary.
Corollary 4.1.19. Let Z ∼ N(0, 1). Then for any µ ∈ R, (µ+Z)2 is (4(1+2µ2 ), 4)-sub-exponential,
and more precisely,
2 2
λ2
2 2
2λ µ
E exp λ (µ + Z) − (µ + 1) ≤ exp + .
1 − 2λ [1 − 2|λ|]+
Additionally, if Z ∼ N(0, I), then for any matrix A and vector b, ∥AZ − b∥22 is sub-exponential with
h i 1
E exp λ ∥AZ − b∥22 − ∥A∥2Fr − ∥b∥22 ≤ exp 2λ2 (∥A∥2Fr + 2 ∥b∥22 ) for |λ| ≤ .
4 ∥A∥2op
76
Lexture Notes on Statistics and Information Theory John Duchi
where inequality (i) used the Bernstein condition (4.1.7). Noting that 1+x ≤ ex gives the result.
As one final example, we return to Bennett’s inequality (4.1.5) from Example 4.1.14.
Proof We assume without loss of generality that E[X] = 0. Using the standard Chernoff bound
argument coupled with inequality (4.1.5), we see that
n n
! !
X X X σi2 λb
P Xi ≥ t ≤ exp e − 1 − λb − λt .
b2
i=1 i=1
Letting h(t) = (1 + t) log(1 + t) − t as in the statement of the proposition and σ 2 = ni=1 σi2 , we
P
minimize over λ ≥ 0, setting λ = 1b log(1 + σbt2 ). Substituting into our Chernoff bound application
gives the proposition.
A slightly more intuitive writing of Bennett’s inequality is to use averages, in which case for
1 Pn
σ2 = n i=1 σi2 the average of the variances,
n
!
nσ 2
1X bt
P Xi ≥ t ≤ exp − h .
n b σ2
i=1
That this is a norm is not completely trivial, though a few properties are immediate: clearly
∥aX∥ψ = |a| ∥X∥ψ , and we have ∥X∥ψ = 0 if and only if X = 0 with probability 1. The key result
is that in fact, ∥·∥ψ is actually convex, which then guarantees that it is a norm.
77
Lexture Notes on Statistics and Information Theory John Duchi
Proposition 4.1.22. The function ∥·∥ψ is convex on the space of random variables.
Proof Because ψ is convex and non-decreasing, x 7→ ψ(|x|) is convex as well. (Convince yourself
of this.) Thus, its perspective transform pers(ψ)(t, |x|) := tψ(|x|/t) is jointly convex in both t ≥ 0
and x (see Appendix B.3.3). This joint convexity of pers(ψ) implies that for any random variables
X0 and X1 and t0 , t1 ,
E[pers(ψ)(λt0 + (1 − λ)t1 , |λX0 + (1 − λ)X1 |)] ≤ λE[pers(ψ)(t0 , |X0 |)] + (1 − λ)E[pers(ψ)(t1 , |X1 |)].
that is, the triangle inequality holds. This implies that centering a variable can never increase its
norm by much:
∥X − E[X]∥ψ ≤ ∥X∥ψ + ∥E[X]∥ψ ≤ ∥X∥ψ + ∥X∥ψ
by Jensen’s inequality, so that ∥X − E[X]∥ψ ≤ 2 ∥X∥ψ .
We can recover several standard norms on random variables, including some we have already
implicitly used. The first are the classical Lp norms, where we take ψ(t) = tp , where we see that
We also have what we term the sub-Gaussian and sub-Exponential norms, which we denote by
considering the functions
ψp (x) := exp (|x|p ) − 1.
These induce the Orlicz ψp -norms, as for p ≥ 1, these are convex (as they are the composition of the
increasing convex function exp(·) applied to the nonnegative convex function | · |p ). Theorem 4.1.11
shows that we have a sub-Gaussian norm
Many relationships follow immediately from the definitions (4.1.10) and (4.1.11). For example,
the definition of the ψp -norms immediately implies that a sub-Gaussian random variable (whether
or not it is mean zero) has a sub-exponential square:
∥X∥2ψ2 = X 2 ψ1
.
Additionally,
∥XY ∥ψ1 ≤ ∥X∥ψ2 ∥Y ∥ψ2 .
78
Lexture Notes on Statistics and Information Theory John Duchi
2 2
Proof We prove only the second statement. Because xy ≤ x2η + ηy2 for any x, y, and any η > 0,
for any t > 0 we have
2
ηY 2
X
E[exp(|XY |/t)] ≤ E exp + ≤ E[exp(X 2 /ηt)]1/2 E[exp(ηY 2 /t)]1/2
2ηt 2t
by the Cauchy-Schwarz inequality. In particular, if we take t = ∥X∥ψ2 ∥Y ∥ψ2 , then the choice η =
∥X∥2ψ2 / ∥Y ∥2ψ2 gives E[exp(X 2 /ηt)] ≤ 2 and E[exp(ηY 2 /t)] ≤ 2, so that E[exp(|XY |/t)] ≤ 2.
By tracing through the arguments in the proofs of Theorems 4.1.11 and 4.1.15, we can also see that
we have the equivalences
1 1
∥X∥ψ2 ≍ sup √ E[|X|k ]1/k and ∥X∥ψ1 ≍ sup E[|X|k ]1/k ,
k∈N k k∈N k
Corollary 4.1.24. Let X be any random variable with ∥X∥ψ1 < ∞. Then for all t ≥ 0,
P(|X| ≥ t) ≤ 2 exp −t/ ∥X∥ψ1
Proof The first statement is nearly trivial: we have by the Chernoff bounding method that
h i
P(|X| ≥ t) ≤ E exp |X|/ ∥X∥ψ1 exp(−t/ ∥X∥ψ1 ) ≤ 2 exp(−t/ ∥X∥ψ1 )
by definition
R ∞ of the ψ1 -norm. For the second, we mimic the proof of Theorem 4.1.15: because
E[Z] = 0 P(Z ≥ t)dt for Z ≥ 0, we have
∞ ∞ ∞
E[|X|k ]
Z Z Z
≤ P(|X|/ ∥X∥ψ1 ≥ t 1/k
)dt = k P(|X|/ ∥X∥ψ1 ≥ u)u k−1
du ≤ 2k uk−1 e−u du
∥X∥kψ1 0 0 0
using the substitution uk = t. Rearranging yields E[|X|k ] ≤ 2 ∥X∥kψ1 Γ(k + 1) = 2 ∥X∥kψ1 k!. Then
computing the moment generating function, we obtain
∞ ∞
h i X λk E[|X|k ] X 2λ2
E exp(λX/ ∥X∥ψ1 ) ≤ 1 + k
≤ 1 + 2 λk = 1 +
∥X∥ψ1 k! 1 − |λ|
k=2 k=2
for |λ| < 1. For |λ| ≤ 12 , we use 1 + x ≤ ex to obtain E[exp(λX/ ∥X∥ψ1 )] ≤ exp(4λ2 ), which is the
desired result.
79
Lexture Notes on Statistics and Information Theory John Duchi
Depending on the norm chosen, this task may be impossible; for the Euclidean (ℓ2 ) norm, however,
such an embedding is easy to construct using Gaussian random variables and with m = O( ϵ12 log n).
This embedding is known as the Johnson-Lindenstrauss embedding. Note that this size m is
independent of the dimension d, only depending on the number of points n.
and let Φi ∈ Rd denote the ith row of this matrix. We claim that
8 1
m ≥ 2 2 log n + log implies ∥Φui − Φuj ∥22 ∈ (1 ± ϵ) ∥ui − uj ∥22
ϵ δ
log n
for all pairs ui , uj with probability at least 1 − δ. In particular, m ≳ ϵ2
is sufficient to achieve
accurate dimension reduction with high probability.
To see this, note that for any fixed vector u,
m
⟨Φi , u⟩ ∥Φu∥22 X
∼ N(0, 1/m), and = ⟨Φi , u/ ∥u∥2 ⟩2
∥u∥2 ∥u∥22 i=1
is a sum of independent scaled χ2 -random variables. In particular, we have E[∥Φu/ ∥u∥2 ∥22 ] = 1,
and using the χ2 -concentration result of Example 4.1.13 yields
P ∥Φu∥22 / ∥u∥22 − 1 ≥ ϵ = P m ∥Φu∥22 / ∥u∥22 − 1 ≥ mϵ
mϵ2
2
≤ 2 inf exp 2mλ − λmϵ = 2 exp − ,
|λ|≤ 41 8
80
Lexture Notes on Statistics and Information Theory John Duchi
the last inequality holding for ϵ ∈ [0, 1]. Now, using the union bound applied to each of the
pairs (ui , uj ) in the sample, we have
mϵ2
n
P there exist i ̸= j s.t. ∥Φ(ui − uj )∥22 − ∥ui − uj ∥22 ≥ ϵ ∥ui − uj ∥22 ≤ 2 exp − .
2 8
2
Taking m ≥ ϵ82 log nδ = 16ϵ2
log n + ϵ82 log 1δ yields that with probability at least 1 − δ, we have
∥Φui − Φuj ∥22 ∈ (1 ± ϵ) ∥ui − uj ∥22 . 3
Computing low-dimensional embeddings of high-dimensional data is an area of active research,
and more recent work has shown how to achieve sharper constants [63] and how to use more struc-
tured matrices to allow substantially faster computation of the embeddings Φu (see, for example,
Achlioptas [2] for early work in this direction, and Ailon and Chazelle [5] for the so-called “Fast
Johnson-Lindenstrauss transform”).
the maximum likelihood decoder. We now investigate how to choose a collection {x1 , . . . , xm }
of such codewords and give finite sample bounds on its probability of error. In fact, by using
concentration inequalities, we can show that a randomly drawn codebook of fairly small dimension
is likely to enjoy good performance.
Intuitively, if our codebook {x1 , . . . , xm } ⊂ {0, 1}d is well-separated, meaning that each pair of
words xi , xk satisfies ∥xi − xk ∥1 ≥ cd for some numerical constant c > 0, we should be unlikely to
make a mistake. Let us make this precise. We mistake word i for word k only if the received signal
Z satisfies ∥Z − xi ∥1 ≥ ∥Z − xk ∥1 , and letting J = {j ∈ [d] : xij ̸= xkj } denote the set of at least
c · d indices where xi and xk differ, we have
X
∥Z − xi ∥1 ≥ ∥Z − xk ∥1 if and only if |Zj − xij | − |Zj − xkj | ≥ 0.
j∈J
If xi is the word being sent and xi and xk differ in position j, then |Zj − xij | − |Zj − xkj | ∈ {−1, 1},
and is equal to −1 with probability (1 − ϵ) and 1 with probability ϵ. That is, we have ∥Z − xi ∥1 ≥
∥Z − xk ∥1 if and only if
X
|Zj − xij | − |Zj − xkj | + |J|(1 − 2ϵ) ≥ |J|(1 − 2ϵ) ≥ cd(1 − 2ϵ),
j∈J
81
Lexture Notes on Statistics and Information Theory John Duchi
and the expectation EQ [|Zj − xij | − |Zj − xkj | | xi ] = −(1 − 2ϵ) when xij ̸= xkj . Using the Hoeffding
bound, then, we have
where we have used that there are at least |J| ≥ cd indices differing between xi and xk . The
probability of making a mistake at all is thus at most m exp(− 12 cd(1 − 2ϵ)2 ) if our codebook has
separation c · d.
For low error decoding to occur with extremely high probability, it is thus sufficient to choose
a set of code words {x1 , . . . , xm } that is well separated. To that end, we state a simple lemma.
m2
d m
exp −2dt2 ≤ exp −2dt2 .
P ∃ i, j s.t. ∥Xi − Xj ∥1 < − dt ≤
2 2 2
Proof First,
n let us consider
o two independent draws X and X ′ uniformly on the hypercube. Let
Pd
Z = j=1 1 Xj ̸= Xj′ = dham (X, X ′ ) = ∥X − X ′ ∥1 . Then E[Z] = d2 . Moreover, Z is an i.i.d.
1
sum of Bernoulli 2 random variables, so that by our concentration bounds of Corollary 4.1.10, we
have
2t2
′ d
P X − X 1 ≤ − t ≤ exp − .
2 d
Using a union bound gives the remainder of the result.
(ii) By taking m ≤ exp(d/32), or d ≥ 32 log m, and δ = e−d/32 , then with probability at least
1 − e−d/32 —exponentially large in d—a randomly drawn codebook has all its entries separated
by at least ∥xi − xj ∥1 ≥ d4 .
8 log m
δ
d ≥ max 32 log m, .
(1 − 2ϵ)2
Then with probability at least 1 − 1/m over the draw of the codebook, the probability we make a
mistake in transmission of any given symbol i over the channel Q is at most δ.
82
Lexture Notes on Statistics and Information Theory John Duchi
for all n ∈ N.
There are numerous examples of martingale sequences. The classical one is the symmetric
random walk.
Example 4.2.1: Let Dn ∈ {±1} be uniform and independent. Then Dn form a martingale
difference sequence adapted to themselves (that is, we may take Zn = Dn ), and Mn = ni=1 Di
P
is a martingale. 3
A more sophisticated example, to which we will frequently return and that suggests the potential
usefulness of martingale constructions, is the Doob martingale associated with a function f .
Example 4.2.2 (Doob martingales): Let f : X n → R be an otherwise arbitrary function,
and let X1 , . . . , Xn be arbitrary random variables. The Doob martingale is defined by the
difference sequence
Di := E[f (X1n ) | X1i ] − E[f (X1n ) | X1i−1 ].
By inspection, the Di are functions of X1i , and we have
83
Lexture Notes on Statistics and Information Theory John Duchi
by the tower property of expectations. Thus, the Di satisfy Definition 4.4 of a martingale
difference sequence, and moreover, we have
n
X
Di = f (X1n ) − E[f (X1n )],
i=1
and so the Doob martingale captures exactly the difference between f and its expectation. 3
Proof The proof is essentially immediate: letting Zn be the sequence to which the Dn are
adapted, we write
" n #
Y
λDi
E[exp(λMn )] = E e
i=1
" " n ##
Y
=E E eλDi | Z1n−1
i=1
" "n−1 # #
Y
λDi n−1 λDn n−1
=E E e | Z1 E[e | Z1 ]
i=1
because D1 , . . . , Dn−1 are functions of Z1n−1 . Then we use Definition 4.5, which implies that
2 2
E[eλDn | Z1n−1 ] ≤ eλ σn /2 , and we obtain
"n−1 #
λ2 σn2
Y
λDi
E[exp(λMn )] ≤ E e exp .
2
i=1
84
Lexture Notes on Statistics and Information Theory John Duchi
as desired.
The second claims are simply applications of Chernoff bounds via Proposition 4.1.8 and that
E[Mn ] = 0.
Corollary 4.2.4. LetPDi be a bounded difference martingale difference sequence, meaning that
|Di | ≤ c. Then Mn = ni=1 Di satisfies
t2
−1/2 −1/2
P(n Mn ≥ t) ∨ P(n Mn ≤ −t) ≤ exp − 2 for t ≥ 0.
2c
√
Thus, bounded random walks are (with high probability) within ± n of their expectations after
n steps.
There exist extensions of these inequalities to the cases where we control the variance of the
martingales; see Freedman [96].
Definition 4.6 (Bounded differences). Let f : X n → R for some space X . Then f satisfies
bounded differences with constants ci if for each i ∈ {1, . . . , n}, all xn1 ∈ X n , and x′i ∈ X we have
i−1 ′
|f (xi−1 n n
1 , xi , xi+1 ) − f (x1 , xi , xi+1 )| ≤ ci .
The classical inequality relating bounded differences and concentration is McDiarmid’s inequal-
ity, or the bounded differences inequality.
2t2
n n n n
P (f (X1 ) − E[f (X1 )] ≥ t) ∨ P (f (X1 ) − E[f (X1 )] ≤ −t) ≤ exp − Pn 2 .
i=1 ci
Proof The basic idea is to show that the Doob martingale (Example 4.2.2) associated with f is
c2i /4-sub-Gaussian, and then to simply apply the Azuma-HoeffdingPn inequality. nTo that end, define
n i n i−1 n
Di = E[f (X1 ) | X1 ] − E[f (X1 ) | X1 ] as before, and note that i=1 Di = f (X1 ) − E[f (X1 )]. The
random variables
85
Lexture Notes on Statistics and Information Theory John Duchi
where we have used the independence of the Xi and Definition 4.6 of bounded differences. Conse-
quently, we have by Hoeffding’s Lemma (Example 4.1.6) that E[eλDi | X1i−1 ] ≤ exp(λ2 c2i /8), that
is, the Doob martingale is c2i /4-sub-Gaussian.
The remainder of the proof is simply Theorem 4.2.3.
A number of quantities satisfy the conditions of Proposition 4.2.5, and we give two examples
here; we will revisit them more later.
Example 4.2.6 (Bounded random vectors): Let B be a Banach space—a complete normed
vector space—with norm ∥·∥. Let Xi be independent bounded random vectors in B satisfying
E[Xi ] = 0 and ∥Xi ∥ ≤ c. We claim that the quantity
n
1X
f (X1n ) := Xi
n
i=1
i−1 ′ n 1 2c
|f (xi−1 n
1 , x, xi+1 ) − f (x1 , x , xi+1 )| ≤ x − x′ ≤ .
n n
Consequently, if Xi are indpendent, we have
n n
!
nt2
1X 1X
P Xi − E Xi ≥ t ≤ 2 exp − 2 (4.2.1)
n n 2c
i=1 i=1
for all t ≥ 0. That is, the norm of (bounded) random vectors in an essentially arbitrary vector
space concentrates extremely quickly about its expectation.
The challenge becomes to control the expectation term in the concentration bound (4.2.1),
which can be a bit challenging. In certain cases—for example, when we have a Euclidean
structure on the vectors Xi —it can be easier. Indeed, let us specialize to the case that Xi ∈ H,
a (real) Hilbert space, so that there is an inner product ⟨·, ·⟩ and the norm satisfies ∥x∥2 = ⟨x, x⟩
for x ∈ H. Then Cauchy-Schwarz implies that
Xn 2 X n 2 X Xn
E Xi ≤E Xi = E[⟨Xi , Xj ⟩] = E[∥Xi ∥2 ].
i=1 i=1 i,j i=1
That is assuming the Xi are independent and E[∥Xi ∥2 ] ≤ σ 2 , inequality (4.2.1) becomes
nt2
σ σ
P X n ≥ √ + t + P X n ≤ − √ − t ≤ 2 exp − 2
n n 2c
where X n = n1 ni=1 Xi . 3
P
86
Lexture Notes on Statistics and Information Theory John Duchi
We can specialize Example 4.2.6 to a situation that is very important for treatments of concen-
tration, sums of random vectors, and generalization bounds in machine learning.
Example 4.2.7 (Rademacher complexities): This example is actually a special case of Ex-
ample 4.2.6, but its frequent uses justify a more specialized treatment and consideration. Let
X be some space, and let F be some collection of functions f : X → R. Let εi ∈ {−1, 1} be a
collection of independent random sign vectors. Then the empirical Rademacher complexity of
F is
n
" #
1 X
Rn (F | xn1 ) := E sup εi f (xi ) ,
n f ∈F i=1
where the expectation is over only the random signs
P εi . (In some cases, depending on context
and convenience, one takes the absolute value | i εi f (xi )|.) The Rademacher complexity of
F is
Rn (F) := E[Rn (F | X1n )],
the expectation of the empirical Rademacher complexities.
If f : X → [b0 , b1 ] for all f ∈ F, then the Rademacher complexity satisfies bounded
differences, because for any two sequences xn1 and z1n differing in only element j, we have
n
X
n n
n|Rn (F | x1 )−Rn (F | z1 )| ≤ E sup εi (f (xi )−f (zi )) = E[sup εi (f (xj )−f (zj ))] ≤ b1 −b0 .
f ∈F i=1 f ∈F
(b1 −b0 )2
Consequently, the empirical Rademacher complexity satisfies Rn (F | X1n ) − Rn (F) is 4n -
sub-Gaussian by Theorem 4.2.3. 3
These examples warrant more discussion, and it is possible to argue that many variants of these
random variables are well-concentrated. For example, instead of functions we may simply consider
an arbitrary set A ⊂ Rn and define the random variable
n
X
Z(A) := sup⟨a, ε⟩ = sup ai εi .
a∈A a∈A i=1
As a function of the random signs εi , we may write Z(A) = f (ε), and this is then a function
satisfying |f (ε) − f (ε′ )| ≤ supa∈A |⟨a, ε − ε′ ⟩|, so that ′
Pn if ε and ε differ in index i, we have |f (ε) −
′
f (ε )| ≤ 2 supa∈A |ai |. That is, Z(A) − E[Z(A)] is i=1 supa∈A |ai |2 -sub-Gaussian.
as elements of this vector space L. (Here we have used 1Xi to denote the point mass at Xi .)
Then the Rademacher complexity is nothing more than the expected norm of Pn0 , a random
vector, as in Example 4.2.6. This view is somewhat sophisticated, but it shows that any general
results we may prove about random vectors, as in Example 4.2.6, will carry over immediately
to versions of the Rademacher complexity. 3
87
Lexture Notes on Statistics and Information Theory John Duchi
where λmax and λmin denote maximal and minimal eigenvalues, respectively. Our approach will
be to generalize the approach using moment generating functions, though this becomes non-trivial
because there is no immediately obvious analogue of the tensorization identities we have for scalars.
While in the scalar case, for a sum Sn = ni=1 Xi of independent random variables, we have
P
n
Y
λSn
e = eλXi ,
i=1
such an identify fails for matrices, because their exponentials (typically) fail to commute.
To develop the basic matrix concentration equalities we provide, we require a brief review of
matrix calculus and operator functions. We shall typically work with Hermitian matrices A ∈ Cd×d ,
meaning that A = A∗ , where A∗ denotes the Hermitian transpose of A, whose entries are (A∗ )ij =
Aji , the conjugate of Aji . We work in this generality for two reasons: first, because such matrices
admit the spectral decompositions we require to develop the operators we use, and second, because
dist
we often will encounter random matrices with symmetric distributions, meaning that X = −X,
which can lead to confusion.
With this, we give a brief review of some properties of Hermitian matrices and some associated
matrix operators. Let Hd := {A ∈ Cd×d | A∗ = A} be the Hermitian matrices. The spectral
theorem states gives that any A ∈ Hd admits the spectral decomposition A = U ΛU ∗ , where Λ is
the diagonal matrix of the (necessarily) real eigenvalues of A and U ∈ Cn×n is unitary, so that
U ∗ U = U U ∗ = I. For a function f : R → R, we can then define its operator extension to Hd by
where A has spectral decomposition A = U ΛU ∗ and λi (A) denotes the ith eigenvalue of A. Because
we wish to mimic the approach based on moment generating functions that yields our original sub-
Gaussian and sub-Exponential concentration inequalities in Chapter 4, the most important function
for us will be the exponential, which evidently satisfies
∞
X 1 k
exp(A) = A ,
k!
k=0
88
Lexture Notes on Statistics and Information Theory John Duchi
A, B ∈ Cm×n we define ⟨A, B⟩ = tr(A∗ B), while the the space of Hermitian matrices admits the
real inner product ⟨A, B⟩ := tr(A ∗
Pd B). (See Exercise 4.14.) The spectral theorem also shows the
standard identity that tr(A) = j=1 λj (A) for A ∈ Hd .
To analogize our approach with real-valued random variables, we begin with the Chernoff bound,
Proposition 4.1.3. Here, we have the following observation:
Proof First, apply the standard Chernoff bound to the random variable λmax (X), which gives
that for any λ > 0 that
P(λmax (X) ≥ t) ≤ E[eλ·λmax (X) ]e−λt .
Then observe that by definition of the matrix exponential, we have eλ·λmax (X) ≤ tr(eλX ), because
the eigenvalues of eλX are all positive.
We would like now to provide some type of general tensorization identity for matrices, in analogy
with Propositions 4.1.9 or 4.1.17. Unfortunately, this breaks down: for Hermitian A, B, we have
eA+B = eA eB
if and only if A and B commute [153], so that they are simultaneously diagonalizable. Nonetheless,
we have the following inequality, which will be the key to extending the standard one-dimensional
approach to concentration:
tr(eA+B ) ≤ tr(eA eB ).
While the proof is essentially elementary, it is not central to our development, so we defer it to
Section 4.4.4. We remark in passing that there is a converse [153, Section 3]: tr(eA+B ) = tr(eA eB ) if
and only if AB = BA, that is, A and B are simultaneously diagonalizable. With Proposition 4.3.2 in
hand, however, we can develop matrix analogues of the Hoeffding and Bernstein-type concentration
bounds in Chapter 4.
We begin with Azuma-Hoeffding-type bounds, which analogize Theorem 4.2.3. The key to allow
an iterative “peeling” off of individual terms in a sum of random matrices is the following result:
Proof Let X ′ be an independent copy of X. Then because the trace exponential tr(eX ) is convex
on the Hermitian matrices (see Exercise 4.16), we have
′ ′
tr(E[eH+X ]) = tr(E[eH+(X−E[X ]) ]) ≤ tr(E[eH+X−X ])
89
Lexture Notes on Statistics and Information Theory John Duchi
dist
by Jensen’s inequality. Introducing the random sign ε ∈ {±1}, we have by symmetry that X−X ′ =
ε(X − X ′ ), and so
′ ′
tr(E[eH+X ]) ≤ E[tr(eH+εX−εX )] = E[tr(eH/2+εX+H/2−εX )]
′
≤ E[tr(eH/2+εX eH/2−εX )],
where the second inequality follows from Proposition 4.3.2. Now we use that for Hermitian matrices
A, B, we have tr(AB) ≤ ∥A∥2 ∥B∥2 = tr(A2 )1/2 tr(B 2 )1/2 , so that
′ ′ ′
E[tr(eH/2+εX eH/2−εX )] ≤ E[tr(eH+2εX )1/2 tr(eH−2εX )1/2 ] ≤ E[tr(eH+2εX )]1/2 E[tr(eH−2εX )]1/2
dist
by Cauchy-Schwarz. Because εX = −εX ′ , the lemma follows.
This allows us to perform the type of “peeling-off” argument, addressing one term in the sum
at a type, that gives tight enough moment generating function bounds.
where ε ∈ {±1} is an independent random sign and we have used independence. Now, we use the
following calculation: if X is Hermitian and ε a random sign, then
2 /2
E[eεX ] ⪯ E[eX ]. (4.3.1)
Temporarily deferring the argument for inequality (4.3.1), note that it immediately implies E[e2λεX ] ⪯
2 2
E[e2λ X ]. The convexity of the operator norm and that ∥Xn ∥op ≤ bn then imply
2λεXn 2λ2 Xn2 2λ2 ∥Xn2 ∥op 2 2
E[e ] ≤ E[e ] ≤E e ≤ e2λ bn
op op
Repeating the argument by iteratively peeling off the last term Xn−i for Sn−2 through S1 then
yields
Yn
E[tr(exp(λSn ))] ≤ tr(I) exp(2λ2 b2n ),
i=1
which gives the theorem.
To see inequality (4.3.1), note that for any positive semidefinite A, we have A ⪯ tA for t ≥ 1.
Then because X 2k ⪰ 0 for all k ∈ N and (2k)! ≥ 2k k!, we have
∞ ∞
εX
X E[X 2k ] X E[(X 2 )k ] 2 /2
E[e ]=I+ ⪯I+ = E[eX ],
(2k!) 2k k!
k=1 k=1
90
Lexture Notes on Statistics and Information Theory John Duchi
Theorem 4.3.4 immediately implies the following corollary, whose argument parallels those in
Chapter 4 (e.g., Corollary 4.1.10).
Corollary 4.3.5. Let Xi ∈ Hd be independent mean-zero Hermitian matrices. Then Sn := ni=1 Xi
P
satisfies
t2
P ∥Sn ∥op ≥ t ≤ 2d exp − Pn 2 .
8 i=1 bi
If we have more direct bounds on E[eλXi ], then we can also employ those via a similar “peeling
off” the last term argument. By carefully controlling matrix moment generating functions in a way
similar to that we did in Example 4.1.14 to obtain sub-exponential behavior for bounded random
variables, we can give a matrix Bernstein-type inequality.
Theorem 4.3.6. Let Xi be independent Hermitian matrices with ∥Xi ∥op ≤ b and E[Xi2 ] op
≤ σi2 .
Then Sn = ni=1 Xi satisfies
P
t2
3t
P ∥Sn ∥op ≥ t ≤ 2d exp − min , .
4 ni=1 σi2 4b
P
The proof of the theorem is similar to that of Theorem 4.3.4, so we leave it as an extended exercise
(Exercise 4.17).
We unpack the theorem a bit to give some intuition. Given a variance bound σ 2 such that
E[Xi2 ] ⪯ σ 2 I, the theorem states that
2
−1
nt 3nt
P n Sn op ≥ t ≤ 2d exp − min , .
4σ 2 4b
q
2σ
Letting δ ∈ (0, 1) be arbitrary and setting t = max{ √ n
log 2d 4b d
δ , 3n log δ }, we have
n
( r )
1X 2σ 2d 4b d
Xi ≤ max √ log , log
n n δ 3n δ
i=1 op
with probability at least 1 − δ. So we see the familiar sub-Gaussian and sub-exponential scaling of
the random sum.
where for the last inequality we made the substitution u = t2 /σ 2 . Noting that this final integral is
Γ(k/2), we have E[|X|k ] ≤ kσ k Γ(k/2). Because Γ(s) ≤ ss for s ≥ 1, we obtain
p √
E[|X|k ]1/k ≤ k 1/k σ k/2 ≤ e1/e σ k.
Thus (2) holds with K2 = e1/e .
91
Lexture Notes on Statistics and Information Theory John Duchi
1 k
(2) implies (3) Let σ = supk≥1 k − 2 E[|X|k ]1/k , so that K2 = 1 and E[|X|k ] ≤ k 2 σ for all k. For
K3 ∈ R+ , we thus have
∞ ∞ ∞
E[X 2k ] σ 2k (2k)k (i) X 2e k
X X
E[exp(X 2 /(K3 σ 2 ))] = ≤ ≤
k!K32k σ 2k k!K32k σ 2k K32
k=0 k=0 k=0
where inequality (i) follows because k! ≥ (k/e)k , or 1/k! ≤ (e/k)k . Noting that ∞ k 1
P
p k=0 α = 1−α ,
we obtain (3) by taking K3 = e 2/(e − 1) ≈ 2.933.
(3) implies (4) Let us take K3 = 1 and recall the assumption of (4) that E[X] = 0. We claim
that (4) holds with K4 = 34 . We prove this result for both small and large λ. First, note the
9x2
(non-standard but true!) inequality that ex ≤ x + e for all x. Then we have
16
2 2
9λ X
E[exp(λX)] ≤ E[λX] +E exp
| {z } 16
=0
4
Now note that for |λ| ≤ we have 9λ2 σ 2 /16 ≤ 1, and so by Jensen’s inequality,
3σ ,
2 2
9λ X 2
2 2
2 9λ16σ 9λ2 σ 2
E exp = E exp(X /σ ) ≤ e 16 .
16
λ2 cx2
For large λ, we use the simpler Fenchel-Young inequality, that is, that λx ≤ 2c + 2 , valid for all
c ≥ 0. Then we have for any 0 ≤ c ≤ 2 that
2
λ2 σ 2 cX λ2 σ 2 c
E[exp(λX)] ≤ e 2c E exp ≤ e 2c e 2 ,
2σ 2
4 1 9 2 2
where the final inequality follows from Jensen’s inequality. If |λ| ≥ 3σ , then 2 ≤ 32 λ σ , and we
have 2 2
1
[ 2c 9c 2 2
+ 32 ]λ σ 3λ σ
E[exp(λX)] ≤ inf e = exp .
c∈[0,2] 4
(3) implies (1) Assume (3) holds with K3 = 1. Then for t ≥ 0 we have
λt2
P(|X| ≥ t) = P(X 2 /σ 2 ≥ t2 /σ 2 ) ≤ E[exp(λX 2 /σ 2 )] exp − 2
σ
for all λ ≥ 0. For λ ≤ 1, Jensen’s inequality implies E[exp(λX 2 /σ 2 )] ≤ E[exp(X 2 /σ 2 )]λ ≤ eλ by
assumption (3). Set λ = log 2 ≈ .693.
1
(4) implies (1) This is the content of Proposition 4.1.8, with K4 = 2 and K1 = 2.
where we used the substitution u = t/σ. Thus we have E[|X|k ] ≤ 2Γ(k + 1)σ k , and using Γ(k + 1) ≤
k k yields E[|X|k ]1/k ≤ 21/k kσ, so that (2) holds with K2 ≤ 2.
92
Lexture Notes on Statistics and Information Theory John Duchi
where inequality (i) used that k! ≥ (k/e)k . Taking K3 = e2 /(e − 1) < 5 gives the result.
(2) if and only if (4) Assume that (2) holds with K2 = 1, and let σ = supk≥1 k1 E[|X|k ]1/k . Then
because E[X] = 0,
∞ ∞ ∞
X λk E[X k ] X (k|λ|σ)k X
E[exp(λX)] ≤ 1 + ≤1+ ≤1+ (e|λ|σ)k ,
k! k!
k=2 k=2 k=2
1
where we have used that k! ≥ (k/e)k . When |λ| < σe , evaluating the geometric series yields
(eλσ)2
E[exp(λX)] ≤ 1 + .
1 − e|λ|σ
1
For |λ| ≤ 2eσ , we obtain E[eλX ] ≤ 1 + 2e2 σ 2 λ2 , and as 1 + x ≤ ex this implies (4).
For the opposite direction, assume (4) holds with K4 = K4′ = 1. Then E[exp(λX/σ)] ≤ exp(1)
for λ ∈ [−1, 1], and (3) holds. The preceding parts imply the remainder of the equivalence.
We leave the proof of the equality (4.4.1) as Exercise 4.15. Using it, however, it is evidently sufficient
to prove that there exists some sequence of integers n → ∞ where along this sequence,
n
tr eA/n eB/n ≤ tr(eA eB ). (4.4.2)
93
Lexture Notes on Statistics and Information Theory John Duchi
Now recall that the Schatten p-norm of a matrix A is ∥A∥p := tr((AA∗ )p/2 )1/p = ∥γ(A)∥p ,
the
P ℓp -norm of its singular values, where p = 2 gives the Euclidean or Frobenius norm ∥A∥2 =
2 1/2
( i,j |Aij | ) . This norm gives a generalized Hölder-type inequality for powers of 2, that is,
n ∈ {2k }k∈N , which we can in turn use to prove the Golden-Thompson inequality. In particular,
we demonstrate that for n a power of 2,
To see this inequality, we proceed inductively. Because the trace defines the inner product
⟨A, B⟩ = tr(A∗ B), for n = 2, the Cauchy-Schwarz inequality implies
We now perform an induction, where we have demonstrated the base case n = 2. Then for n ≥ 4
a power of 2, we have by the inductive hypothesis that inequality (4.4.3) holds for n/2 that
Now consider an arbitrary pair of matrices A, B. We will demonstrate that ∥AB∥n/2 ≤ ∥A∥n ∥B∥n ,
which will then evidently imply inequality (4.4.3). For these, we have
n/2 ∗ ∗ ∗ ∗ ∗ ∗ n/4
∥AB∥n/2 = tr(ABB
| A ·
{z· · ABB A} ) = tr (A ABB )
n/4 times
by the cyclic property of the trace. Using the inductive hypothesis again with n/4 copies of each
of the matrices AT A and BB T , we thus have
n/2
∥AB∥n/2 ≤ tr (A∗ ABB ∗ )n/4
1/2 1/2
n/4 n/4
≤ ∥A∗ A∥n/2 ∥BB ∗ ∥n/2 = tr (A∗ A)n/2 tr (BB ∗ )n/2 = ∥A∥n/2 n/2
n ∥B∥n .
That is, we have ∥AB∥n/2 ≤ ∥A∥n ∥B∥n for any A, B as desired, giving inequality (4.4.3).
We apply inequality (4.4.3) to powers of products of Hermitian matrices A, B. We have
tr ((AB)n ) ≤ ∥AB∥nn = tr (ABB ∗ A∗ )n/2 = tr (A∗ ABB ∗ )n/2 = tr (A2 B 2 )n/2
because A = A∗ and B = B ∗ . Recognizing that A2 and B 2 are Hermitian, we repeat this argument
to obtain
tr (A2 B 2 )n/2 ≤ tr (A4 B 4 )n/4 ≤ · · · ≤ tr(An B n )
for any n ∈ {2k }k∈N . Replacing A and B by eA and eB , which are both symmetric, we obtain
94
Lexture Notes on Statistics and Information Theory John Duchi
4.5 Bibliography
A few references on concentration, random matrices, and entropies include Vershynin’s extraordi-
narily readable lecture notes [187], upon which our proof of Theorem 4.1.11 is based, the compre-
hensive book of Boucheron, Lugosi, and Massart [37], and the more advanced material in Buldygin
and Kozachenko [43]. Many of our arguments are based off of those of Vershynin and Boucheron
et al. Kolmogorov and Tikhomirov [126] introduced metric entropy.
We give weaker versions of the matrix-Hoeffding and matrix-Bernstein inequalities. It is possible
to do much better.
Ahlswede-Winter developed the matrix concentration inequalities and Petz [153].
I took the proof of Golden-Thompson from Terry Tao’s blog.
Lemma 4.3.3 is [181, Lemma 7.6].
It is possible to obtain better concentration guarantees using Lieb’s concavity inequality that
is a concave function in A ≻ 0.
4.6 Exercises
Exercise 4.1 (Concentration of bounded random variables): Let X be a random variable taking
values in [a, b], where −∞ < a ≤ b < ∞. In this question, we show Hoeffding’s Lemma, that is,
that X is sub-Gaussian: for all λ ∈ R, we have
2
λ (b − a)2
E[exp(λ(X − E[X]))] ≤ exp .
8
(b−a)2
(a) Show that Var(X) ≤ ( b−a 2
2 ) = 4 for any random variable X taking values in [a, b].
(b) Let
ϕ(λ) = log E[exp(λ(X − E[X]))].
Assuming that E[X] = 0 (convince yourself that this is no loss of generality) show that
(c) Construct a random variable Yt , defined for t ∈ R, such that Yt ∈ [a, b] and
95
Lexture Notes on Statistics and Information Theory John Duchi
(b) Let X be a random variable with E[X] = 0 and Var(X) = σ 2 > 0. Show that
1
lim inf log E[exp(λX)] > 0.
|λ|→∞ |λ|
2
Exercise 4.3 (Mills ratio): Let ϕ(t) = √12π e−t /2 be the density of a standard Gaussian, Z ∼
Rt
N(0, 1), and Φ(t) = −∞ ϕ(u)du its cumulative distribution function.
(b) Define
t
g(t) := 1 − Φ(t) − ϕ(t).
t2 + 1
Show that g(0) = 0, g ′ (t) < 0 for all t ≥ 0, and that limt→∞ g(t) = 0.
Exercise 4.4 (Likelihood ratio bounds and concentration): Consider a data release problem,
where given a sample x, we release a sequence of data Z1 , Z2 , . . . , Zn belonging to a discrete set Z,
where Zi may depend on Z1i−1 and x. We assume that the data has limited information about x
in the sense that for any two samples x, x′ , we have the likelihood ratio bound
p(zi | x, z1i−1 )
≤ eε .
p(zi | x′ , z1i−1 )
Let us control the amount of “information” (in the form of an updated log-likelihood ratio) released
by this sequential mechanism. Fix x, x′ , and define
p(z1 , . . . , zn | x)
L(z1 , . . . , zn ) := log .
p(z1 , . . . , zn | x′ )
t2
ε
P (L(Z1 , . . . , Zn ) ≥ nε(e − 1) + t) ≤ exp − .
2nε2
(b) Let γ ∈ (0, 1). Give the largest value of ε you can that is sufficient to guarantee that for any
test Ψ : Z n → {x, x′ }, we have
where Px and Px′ denote the sampling distribution of Z1n under x and x′ , respectively?
96
Lexture Notes on Statistics and Information Theory John Duchi
for all t.
Exercise 4.8 (Lipschitz functions remain sub-Gaussian): Let X be σ 2 -sub-Gaussian and f :
R → R be L-Lipschitz, meaning that |f (x) − f (y)| ≤ L|x − y| for all x, y. Prove that there exists a
numerical constant C < ∞ such that f (X) is CL2 σ 2 -sub-Gaussian.
Exercise 4.9 (Sub-gaussian maxima): Let X1 , . . . , Xn be σ 2 -sub-gaussian (not necessarily inde-
pendent) random variables. Show that
p
(a) E[maxi Xi ] ≤ 2σ 2 log n.
(b) There exists a numerical constant C < ∞ such that E[maxi |Xi |p ] ≤ (Cpσ 2 log k)p/2 .
(b) Use a direct integration argument, as in Examples 4.1.12 and 4.1.13, to show that
2λ2
2
2 1
E exp(λ(µ + Z) ) ≤ exp λµ + − log(1 − 2λ)
1 − 2λ 2
for λ < 21 . Use this to prove the first part of Corollary 4.1.19.
97
Lexture Notes on Statistics and Information Theory John Duchi
x2
Hint. It may be useful to use that − log(1 − x) ≤ −x + 2[1−|x|]+ for all x ∈ R.
Exercise 4.11: Let Z ∼ N(0, Id ), and let A ∈ Rn×d and b ∈ Rn be otherwise arbitrary. Using
the first part of Corollary 4.1.19, show the second part of Corollary 4.1.19, that is, that ∥AZ − b∥22
is (4(∥A∥2Fr + 2 ∥b∥22 ), 4 ∥A∥2op )-sub-exponential. Hint. Use the singular value decomposition A =
U ΓV ⊤ of A, and note that V ⊤ Z ∼ N(0, Id ). Then
Exercise 4.12 (Sub-Gaussian constants of Bernoulli random variables): In this exercise, we will
derive sharp sub-Gaussian constants for Bernoulli random variables (cf. [112, Thm. 1] or [125, 24]),
showing
1 − 2p 2
log E[et(X−p) ] ≤ t for all t ≥ 0. (4.6.1)
4 log 1−p
p
pet(1−p)
where Yt = (1 − p) with probability q(t) := pet(1−p) +(1−p)e−tp
and Yt = −p otherwise.
1 − 2p 2
f (s) = Cs + Cps − log(1 − p + peCs ),
2
so that inequality (4.6.1) holds if and only if f (s) ≥ 0 for all s ≥ 0. Give f ′ (s) and f ′′ (s).
(e) Show that f (0) = f (1) = f ′ (0) = f ′ (1) = 0, and argue that f ′′ (s) changes signs at most twice
and that f ′′ (0) = f ′′ (1) > 0. Use this to show that f (s) ≥ 0 for all s ≥ 0.
JCD Comment: Perhaps use transportation inequalities to prove this bound, and
also maybe give Ordentlich and Weinberger’s “A Distribution Dependent Refinement
of Pinsker’s Inequality” as an exercise.
1−2p
Exercise 4.13: Let s(p) = . Show that s is concave on [0, 1].
log 1−p
p
Exercise 4.14 (Inner products on complex matrices): Recall that ⟨·, ·⟩ is a complex inner product
on a vector space V if it satisfies the following for all x, y, z ∈ V :
(iii) It is conjugate linear in its first argument, so that ⟨αx + y, z⟩ = α⟨x, z⟩ + ⟨y, z⟩ for all α ∈ C.
98
Lexture Notes on Statistics and Information Theory John Duchi
The vector space V is real with real inner product if property (ii) is replaced with the symmetry
⟨x, y⟩ = ⟨y, x⟩ and linearity (iii) holds for α ∈ R.
(a) Show that the space of complex m × n matrices Cm×n has complex inner product ⟨A, B⟩ :=
tr(A∗ B).
(b) Show that the space Hn of n × n Hermitian matrices is a real vector space with inner product
⟨A, B⟩ := tr(A∗ B), and that consequently ⟨A, B⟩ ∈ R.
Exercise 4.15 (The Lie product formula): Let A and B be symmetric (or Hermitian) matrices.
Hint. One argument proceeds as follows. Let O(ϵ) denote a matrix E such that ∥E∥op ≲ ϵ.
First, demonstrate that
1
eA/n = I + A + O(n−2 ).
n
Then show that for any matrix A, we have (I + n−1 A + o(n−1 ))n → exp(A). Combine these.
(b) Give an example of matrices A and B that do not commute and for which exp(A + B) ̸=
exp(A) exp(B).
Exercise 4.16: Define the trace exponential function f (X) := tr(eX ) on the Hermitian matrices.
(a) Prove that f is monotone for the semidefinite order, that is, if X ⪯ Y , then f (X) ≤ f (Y ).
Hint. It is enough to show that for any A ⪰ 0, the one-dimensional function h(t) := f (X + tA)
is monotone in t, or even that h′ (0) ≥ 0.
(b) Prove that the trace exponential is convex on the Hermitian matrices, that is, f (X) := tr(eX )
is convex. Hint. It is enough to show that for any X, V Hermitian that h(t) := f (X + tV ) is
convex in t, for which it in turn suffices to show that h′′ (0) ≥ 0.
Exercise 4.17 (The matrix-Bernstein inequality): In this question, we prove Theorem 4.3.6.
(a) Let Xi ∈ Hd be independent Hermitian matrices and Sn = ni=1 Xi . Use the Golden-Thompson
P
inequality (Proposition 4.3.2) to show that for all λ ∈ R,
n
Y
tr(E[eλSn ]) ≤ d E[eλXi ] .
op
i=1
(b) Extend Example 4.1.14 to the matrix-valued case. Demonstrate that if X is a mean-zero
Hermitian random matrix with ∥X∥op ≤ b, then for all |λ| < 3b ,
1 λ2 E[X 2 ]
E[exp(λX)] ⪯ I + .
1 − b|λ|/3 2
99
Lexture Notes on Statistics and Information Theory John Duchi
3
for |λ| ≤ 2b .
Exercise 4.18: In this question, we use Lieb’s concavity inequality (4.5.1) to obtain a stronger
P X1 , . . . , Xn ∈ Hd be an independent sequence of d × d mean-zero
matrix Hoeffding inequality. Let
Hermitian matrices. Let Sn = ni=1 Xi be their sum.
(a) Let X be a random Hermitian matrix and X 2 ⪯ A2 . Show that
λ2 2
log E[eλεX | X] ⪯ A .
2
Hint. Use that the matrix logarithm is operator monotone, that is, if A ⪯ B, then log A ⪯ log B.
t2
P(λmax (Sn ) ≥ t) ≤ d exp − 2 .
8σ
(e) Give an example of random Hermitian matrices where the preceding bound is much sharper
than Corollary 4.3.5.
Exercise 4.19: In this question, we use Lieb’s concavity inequality (4.5.1) to demonstrate a
sharper matrix Bernstein inequality than Theorem 4.3.6.
(a) Define the matrix cumulant generating function ϕXi (λ) := log E[exp(λXi )]. Show that
n
!
X
E[tr(exp(λSn ))] ≤ tr exp ϕXi (λ) .
i=1
(b) Using Exercise 4.17 part (b), show that if E[Xi ] = 0, E[Xi2 ] ⪯ Σi , and ∥Xi ∥op ≤ b for each i,
then for |λ| < 3b , we have
n
λ2 2
λSn 1 X
E[tr(e )] ≤ d exp σ where σ 2 := Σi .
1 − b|λ|/3 2
i=1 op
100
Lexture Notes on Statistics and Information Theory John Duchi
(c) Show that there exists a numerical constant c > 0 such that for all t ≥ 0,
2
t t
P ∥Sn ∥op ≥ t ≤ 2d exp −c min , .
σ2 b
101
Chapter 5
denote the empirical distribution on {Xi }ni=1 , where 1Xi denotes the point mass at Xi . Then for
functions f : X → R (or more generally, any function f defined on X ), we let
n
1X
Pn f := EPn [f (X)] = f (Xi )
n
i=1
denote the empirical expectation of f evaluated on the sample, and we also let
Z
P f := EP [f (X)] = f (x)dP (x)
denote general expectations under a measure P . With this notation, we study uniform laws of
large numbers, which consist of proving results of the form
sup |Pn f − P f | → 0, (5.1.1)
f ∈F
102
Lexture Notes on Statistics and Information Theory John Duchi
where convergence is in probability, expectation, almost surely, or with rates of convergence. When
we view Pn and P as (infinite-dimensional) vectors on the space of maps from F → R, then we
may define the (semi)norm ∥·∥F for any L : F → R by
∥L∥F := sup |L(f )|,
f ∈F
by the triangle inequality. An entirely parallel argument gives the converse lower bound of − b−a
n ,
and thus Proposition 4.2.5 gives the result.
Proposition 5.1.1 shows that, to provide control over high-probability concentration of ∥Pn − P ∥F ,
it is (at least in cases where F is bounded) sufficient to control the expectation E[∥Pn − P ∥F ]. We
take this approach through the remainder of this section, developing tools to simplify bounding
this quantity.
Our starting points consist of a few inequalities relating expectations to symmetrized quantities,
which are frequently easier to control than their non-symmetrized parts. This symmetrization
technique is widely used in probability theory, theoretical statistics, and machine learning. The key
is that for centered random variables, symmetrized quantities have, to within numerical constants,
similar expectations to their non-symmetrized counterparts. Thus, in many cases, it is equivalent
to analyze the symmetized quantity and the initial quantity.
Proposition 5.1.2. Let Xi be independent random vectors on a (Banach) space with norm ∥·∥
and let εi {−1, 1} be independent random signs. Then for any p ≥ 1,
" n # " n # " n #
X p X p X p
2−p E εi (Xi − E[Xi ]) ≤E (Xi − E[Xi ]) ≤ 2p E εi Xi
i=1 i=1 i=1
1
Some readers may worry about measurability issues here. All of our applications will be in separable spaces,
so that we may take suprema with abandon without worrying about measurability, and consequently we ignore this
from now on.
103
Lexture Notes on Statistics and Information Theory John Duchi
In the proof of the upper bound, we could also show the bound
" n # " n #
X p X p
E (Xi − E[Xi ]) ≤ 2p E εi (Xi − E[Xi ]) ,
i=1 i=1
dist
Now, note that the distribution of Xi − Xi′ is symmetric, so that Xi − Xi′ = εi (Xi − Xi′ ), and thus
" n # " n #
X p X p
E (Xi − E[Xi ]) ≤E εi (Xi − Xi′ ) .
i=1 i=1
as desired.
For the left bound in the proposition, let Yi = Xi − E[Xi ] be the centered version of the random
variables. We break the sum over random variables into two parts, conditional on whether εi = ±1,
using repeated conditioning. We have
" n # " #
X p X X p
E εi Yi =E Yi − Yi
i=1 i:εi =1 i:ε=−1
" " # " ##
X p X p
≤ E 2p−1 E Yi | ε + 2p−1 E Yi |ε
i:εi =1 i:εi −1
" " # " ##
X X p X X p
p−1
=2 E E Yi + E[Yi ] |ε +E Yi + E[Yi ] |ε
i:εi =1 i:εi =−1 i:εi =−1 i:εi =1
" " # " ##
X X p X X p
≤ 2p−1 E E Yi + Yi |ε +E Yi + Yi |ε
i:εi =1 i:εi =−1 i:εi =−1 i:εi =1
n
" #
X p
= 2p E Yi .
i=1
104
Lexture Notes on Statistics and Information Theory John Duchi
The expectation of Pn0 F is of course the Rademacher complexity (Examples 4.2.7 and 4.2.8), and
we have the following corollary.
Corollary 5.1.3. Let F be a class of functions f : X → R and Xi be i.i.d. Then E[∥Pn − P ∥F ] ≤
2E[∥Pn0 ∥F ].
From Corollary 5.1.3, it is evident that by controlling the expectation of the symmetrized process
E[∥Pn0 ∥F ] we can derive concentration inequalities and uniform laws of large numbers. For example,
we immediately obtain that
2nt2
0
P ∥Pn − P ∥F ≥ 2E[∥Pn ∥F ] + t ≤ exp −
(b − a)2
for all t ≥ 0 whenever F consists of functions f : X → [a, b].
There are numerous examples of uniform laws of large numbers, many of which reduce to
developing bounds on the expectation E[∥Pn0 ∥F ], which is frequently possible via more advanced
techniques we develop in Chapter 7. A frequent application of these symmetrization ideas is to
risk minimization problems, as we discuss in the coming section; for these, it will be useful for us
to develop a few analytic and calculus tools. To better match the development of these ideas, we
return to the notation of Rademacher complexities, so that Rn (F) := E[ Pn0 F ]. The first is a
standard result, which we state for its historical value and the simplicity of its proof.
Proposition 5.1.4 (Massart’s finite class bound). Let F be any collection of functions with f :
X → R, and assume that σn2 := n−1 E[maxf ∈F ni=1 f (Xi )2 ] < ∞. Then
P
p
2σn2 log |F|
Rn (F) ≤ √ .
n
Proof For each fixed xn1 , the random variable ni=1 εi f (xi ) is ni=1 f (xi )2 -sub-Gaussian. Now,
P P
σ 2 (xn1 ) := n−1 maxf ∈F ni=1 f (xi )2 . Using the results of Exercise 4.9, that is, that E[maxj≤n Zj ] ≤
P
define
p
2σ 2 log n if the Zj are each σ 2 -sub-Gaussian, we see that
p
n 2σ 2 (xn1 ) log |F|
Rn (F | x1 ) ≤ √ .
n
√ p
Jensen’s inequality that E[ ·] ≤ E[·] gives the result.
A refinement of Massart’s finite class bound applies when the classes are infinite but, on a
collection X1 , . . . , Xn , the functions f ∈ F may take on only a (smaller) number of values. In this
case, we define the empirical shatter coefficient of a collection of points x1 , . . . , xn by SF (xn1 ) :=
card{(f (x1 ), . . . , f (xn )) | f ∈ F }, the number of distinct vectors of values (f (x1 ), . . . , f (xn )) the
functions f ∈ F may take. The shatter coefficient is the maximum of the empirical shatter coeffi-
cients over xn1 ∈ X n , that is, SF (n) := supxn1 SF (xn1 ). It is clear that SF (n) ≤ |F| always, but by
only counting distinct values, we have the following corollary.
105
Lexture Notes on Statistics and Information Theory John Duchi
Corollary 5.1.5 (A sharper variant of Massart’s finite class bound). Let F be any collection of
functions with f : X → R, and assume that σn2 := n−1 E[maxf ∈F ni=1 f (Xi )2 ] < ∞. Then
P
p
2σn2 log SF (n)
Rn (F) ≤ √ .
n
Typical classes with small shatter coefficients include Vapnik-Chervonenkis classes of functions; we
do not discuss these further here, instead referring to one of the many books in machine learning
and empirical process theory in statistics.
The most important of the calculus rules we use are the comparison inequalities for Rademacher
sums, which allow us to consider compositions of function classes and maintain small complexity
measurers. We state the rule here; the proof is complex, so we defer it to Section 4.4.3
Proof The result is an almost immediate consequence of Theorem 5.1.6; we simply recenter our
functions. Indeed, we have
n n
" #
1 X 1 X
Rn (ϕ ◦ F | xn1 ) = E sup εi (ϕ(f (xi )) − ϕ(0)) + εi ϕ(0)
f ∈F n i=1 n
i=1
n n
" # " #
1X 1X
≤ E sup εi (ϕ(f (xi )) − ϕ(0)) + E εi ϕ(0)
f ∈F n
i=1
n
i=1
|ϕ(0)|
≤ 2LRn (F) + √ ,
n
106
Lexture Notes on Statistics and Information Theory John Duchi
the expectation E[∥Pn0 ∥F ], though the techniques we develop here will have broader use and can
sometimes directly guarantee concentration.
The basic object we wish to control is a measure of the size of the space on which we work.
To that end, we modify notation a bit to simply consider arbitrary vectors θ ∈ Θ, where Θ is a
non-empty set with an associated (semi)metric ρ. For many purposes in estimation (and in our
optimality results in the further parts of the book), a natural way to measure the size of the set is
via the number of balls of a fixed radius δ > 0 required to cover it.
Definition 5.1 (Covering number). Let Θ be a set with (semi)metric ρ. A δ-cover of the set Θ with
respect to ρ is a set {θ1 , . . . , θN } such that for any point θ ∈ Θ, there exists some v ∈ {1, . . . , N }
such that ρ(θ, θv ) ≤ δ. The δ-covering number of Θ is
The metric entropy of the set Θ is simply the logarithm of its covering number log N (δ, Θ, ρ).
We can define a related measure—more useful for constructing our lower bounds—of size that
relates to the number of disjoint balls of radius δ > 0 that can be placed into the set Θ.
Definition 5.2 (Packing number). A δ-packing of the set Θ with respect to ρ is a set {θ1 , . . . , θM }
such that for all distinct v, v ′ ∈ {1, . . . , M }, we have ρ(θv , θv′ ) ≥ δ. The δ-packing number of Θ is
Figures 5.1 and 5.2 give examples of (respectively) a covering and a packing of the same set.
An exercise in proof by contradiction shows that the packing and covering numbers of a set are
in fact closely related:
Lemma 5.1.8. The packing and covering numbers satisfy the following inequalities:
We leave derivation of this lemma to Exercise 5.2, noting that it shows that (up to constant factors)
packing and covering numbers have the same scaling in the radius δ. As a simple example, we see
107
Lexture Notes on Statistics and Information Theory John Duchi
δ/2
for any interval [a, b] on the real line that in the usual absolute distance metric, N (δ, [a, b], | · |) ≍
(b − a)/δ.
As one example of the metric entropy, consider a set of functions F with reasonable covering
numbers (metric entropy) in ∥·∥∞ -norm.
Example 5.1.9 (The “standard” covering number guarantee): Let F consist of functions
f : X → [−b, b] and let the metric ρ be ∥f − g∥∞ = supx∈X |f (x) − g(x)|. Then
!
nt2
P sup |Pn f − P f | ≥ t ≤ exp − + log N (t/3, F, ∥·∥∞ ) . (5.1.2)
f ∈F 18b2
So as long as the covering numbers N (t, F, ∥·∥∞ ) grow sub-exponentially in t—so that log N (t) ≪
nt2 —we have the (essentially) sub-Gaussian tail bound (5.1.2). Example 5.2.11 gives one typ-
ical case. Indeed, fix a minimal t/3-cover of F in ∥·∥∞ of size N := N (t/3, F, ∥·∥∞ ), call-
ing the covering functions f1 , . . . , fN . Then for any f ∈ F and the function fi satisfying
∥f − fi ∥∞ ≤ t/2, we have
2t
|Pn f − P f | ≤ |Pn f − Pn fi | + |Pn fi − P fi | + |P fi − P f | ≤ |Pn fi − P fi | + .
3
The Azuma-Hoeffding inequality (Theorem 4.2.3) guarantees (by a union bound) that
nt2
P max |Pn fi − P fi | ≥ t ≤ exp − 2 + log N .
i≤N 2b
Given the relationships between packing, covering, and size of sets Θ, we would expect there
to be relationships between volume, packing, and covering numbers. This is indeed the case, as we
now demonstrate for arbitrary norm balls in finite dimensions.
108
Lexture Notes on Statistics and Information Theory John Duchi
Proof We prove the lemma via a volumetric argument. For the lower bound, note that if the
points v1 , . . . , vN are a δ-cover of B, then
N
X
Vol(B) ≤ Vol(δB + vi ) = N Vol(δB) = N Vol(B)δ d .
i=1
In particular, N ≥ δ −d . For the upper bound on N (δ, B, ∥·∥), let V be a δ-packing of B with
maximal cardinality, so that |V| = M (δ, B, ∥·∥) ≥ N (δ, B, ∥·∥) (recall Lemma 5.1.8). Notably, the
collection of δ-balls {δB + vi }M
i=1 cover the ball B (as otherwise, we could put an additional element
in the packing V), and moreover, the balls { 2δ B + vi } are all disjoint by definition of a packing.
Consequently, we find that
d
δ d
δ δ δ
M Vol(B) = M Vol B ≤ Vol B + B = 1 + Vol(B).
2 2 2 2
Rewriting, we obtain
d
δ d Vol(B) 2 d
2
M (δ, B, ∥·∥) ≤ 1+ = 1+ ,
δ 2 Vol(B) δ
for all u ∈ Rd . For example, X ∼ N(0, Id ) is immediately 1-sub-Gaussian, and X ∈ [−b, b]d with
independent entries is b2 -sub-Gaussian. Now, suppose that Xi are independent isotropic random
vectors, meaning that E[Xi ] = 0, E[Xi Xi⊤ ] = Id , and that they are also σ 2 -sub-Gaussian. Then by
an application of Lemma 5.1.10, we can give concentration guarantees for the sample covariance
Σn := n1 ni=1 Xi Xi⊤ for the operator norm ∥A∥op := sup{⟨u, Av⟩ | ∥u∥2 = ∥v∥2 = 1}.
P
Proposition 5.1.11. Let Xi be independent isotropic and σ 2 -sub-Gaussian vectors. Then there is
a numerical constant C such that the sample covariance Σn := n1 ni=1 Xi Xi⊤ satisfies
P
s
d + log 1δ d + log 1δ
∥Σn − Id ∥op ≤ Cσ 2 +
n n
109
Lexture Notes on Statistics and Information Theory John Duchi
Proof The second inequality is trivial. Fix any u ∈ Bd2 . Then for the i such that ∥u − ui ∥2 ≤ ϵ,
we have
⟨u, Au⟩ = ⟨u − ui , Au⟩ + ⟨ui , Au⟩ = 2⟨u − ui , Au⟩ + ⟨ui , Aui ⟩ ≤ 2ϵ ∥A∥op + ⟨ui , Aui ⟩
by definition of the operator norm. Taking a supremum over u gives the final result.
Let the matrix Ei = Xi Xi⊤ − I, and define the average error E n = n1 Ei . Then with this lemma
in hand, we see that for any ϵ-cover N of the ℓ2 -ball Bd2 ,
(1 − 2ϵ) E n op
≤ max⟨u, E n u⟩.
u∈N
Now, note that ⟨u, Ei u⟩ = ⟨u, Xi ⟩2 −∥u∥22 is sub-exponential, as it is certainly mean 0 and, moreover,
is the square of a sub-Gaussian; in particular, Theorem 4.1.15 shows that there is a numerical
constant C < ∞ such that
1
E[exp(λ⟨u, Ei u⟩)] ≤ exp Cλ2 σ 4
for |λ| ≤ .
Cσ 2
1
Taking ϵ = 4 in our covering N , then,
P( E n op ≥ t) ≤ P max⟨u, E n u⟩ ≥ t/2 ≤ |N | · max P ⟨u, nE n u⟩ ≥ nt/2
u∈N u∈N
110
Lexture Notes on Statistics and Information Theory John Duchi
In general, however, we only have access to the risk via the empirical distribution of the Zi , and
we often choose f by minimizing the empirical risk
n
b n (f ) := 1
X
L ℓ(f, Zi ). (5.2.2)
n
i=1
As written, this formulation is quite abstract, so we provide a few examples to make it somewhat
more concrete.
Example 5.2.1 (Binary classification problems): One standard problem—still abstract—
that motivates the formulation (5.2.1) is the binary classification problem. Here the data Zi
come in pairs (X, Y ), where X ∈ X is some set of covariates (independent variables) and
Y ∈ {−1, 1} is the label of example X. The function class F consists of functions f : X → R,
and the goal is to find a function f such that
P(sign(f (X)) ̸= Y )
is small, that is, minimizing the risk E[ℓ(f, Z)] where the loss is the 0-1 loss, ℓ(f, (x, y)) =
1 {f (x)y ≤ 0}. 3
Example 5.2.3 (Binary classification with linear functions): In the standard statistical
learning setting, the data x belong to Rd , and we assume that our function class F is indexed
by a set Θ ⊂ Rd , so that F = {fθ : fθ (x) = θ⊤ x, θ ∈ Θ}. In this case, we may use the zero-one
loss,
the convex hinge loss, or the (convex) logistic loss, which are variously ℓzo (fθ , (x, y)) :=
1 yθ⊤ x ≤ 0 , and the convex losses
h i
ℓhinge (fθ , (x, y)) = 1 − yx⊤ θ and ℓlogit (fθ , (x, y)) = log(1 + exp(−yx⊤ θ)).
+
The hinge and logistic losses, as they are convex, are substantially computationally easier to
work with, and they are common choices in applications. 3
The main motivating question that we ask is the following: given a sample Z1 , . . . , Zn , if we
choose some fbn ∈ F based on this sample, can we guarantee that it generalizes to unseen data? In
particular, can we guarantee that (with high probability) we have the empirical risk bound
n
b n (fbn ) = 1
X
L ℓ(fbn , Zi ) ≤ R(fbn ) + ϵ (5.2.3)
n
i=1
for some small ϵ? If we allow fbn to be arbitrary, then this becomes clearly impossible: consider the
classification example 5.2.1, and set fbn to be the “hash” function that sets fbn (x) = y if the pair
(x, y) was in the sample, and otherwise fbn (x) = −1. Then clearly L b n (fbn ) = 0, while there is no
useful bound on R(fn ).
b
111
Lexture Notes on Statistics and Information Theory John Duchi
Such uniform bounds are certainly sufficient to guarantee that the empirical risk is a good proxy
for the true risk L, even when fbn is chosen based on the data.
Now, recalling that our set of functions or predictors F is finite or countable, let us suppose
that for each f ∈ F, we have a complexity measure c(f )—a penalty—such that
X
e−c(f ) ≤ 1. (5.2.5)
f ∈F
This inequality should look familiar to the Kraft inequality—which we will see in the coming
chapters—from coding theory. As soon as we have such a penalty function, however, we have the
following result.
Theorem 5.2.4. Let the loss ℓ, distribution P on Z, and function class F be such that ℓ(f, Z) is
σ 2 -sub-Gaussian for each f ∈ F, and assume that the complexity inequality (5.2.5) holds. Then
with probability at least 1 − δ over the sample Z1:n ,
s
1
b n (f ) + 2σ 2 log δ + c(f ) for all f ∈ F.
L(f ) ≤ L
n
Proof First, we note that by the usual sub-Gaussian concentration inequality (Corollary 4.1.10)
we have for any t ≥ 0 and any f ∈ F that
nt2
P L(f ) ≥ Ln (f ) + t ≤ exp − 2 .
b
2σ
p
Now, if we replace t by t2 + 2σ 2 c(f )/n, we obtain
2
p
2 2
nt
b n (f ) + t + 2σ c(f )/n ≤ exp −
P L(f ) ≥ L − c(f ) .
2σ 2
Then using a union bound, we have
nt2
X
p
2 2
P ∃ f ∈ F s.t. L(f ) ≥ Ln (f ) + t + 2σ c(f )/n ≤
b exp − 2 − c(f )
2σ
f ∈F
nt2 X
= exp − 2 exp(−c(f )) .
2σ
f ∈F
| {z }
≤1
112
Lexture Notes on Statistics and Information Theory John Duchi
As one classical example of this setting, suppose that we have a finite class of functions F. Then
we can set c(f ) = log |F|, in which case we clearly have the summation guarantee (5.2.5), and we
obtain s
1
L(f ) ≤ Lb n (f ) + 2σ 2 log δ + log |F| uniformly for f ∈ F
n
with probability at least 1 − δ. To make this even more concrete, consider the following example.
Example 5.2.5 (Floating point classifiers): We implement a linear binary classifier using
double-precision floating point values, that is, we have fθ (x) = θ⊤ x for all θ ∈ Rd that may
be represented using d double-precision floating point numbers. Then for each coordinate of
θ, there are at most 264 representable numbers;
⊤ in total, we must thus have |F| ≤ 264d . Thus,
for the zero-one loss ℓzo (fθ , (x, y)) = 1 θ xy ≤ 0 , we have
s
1
b n (fθ ) + log δ + 45d
L(fθ ) ≤ L
2n
for all representable classifiers simultaneously, with probability at least 1 − δ, as the zero-one
loss is 1/4-sub-Gaussian. (Here we have used that 64 log 2 < 45.) 3
We also note in passing that by replacing δ with δ/2 in the bounds of Theorem 5.2.4, a union
bound yields the following two-sided corollary.
Example 5.2.7 (Rademacher complexity of the ℓ2 -ball): Let Θ = {θ ∈ Rd | ∥θ∥2 ≤ r}, and
consider the class of linear functionals F := {fθ (x) = θT x, θ ∈ Θ}. Then
v
u n
r uX
n
Rn (F | x1 ) ≤ t ∥xi ∥22 ,
n
i=1
113
Lexture Notes on Statistics and Information Theory John Duchi
because we have
v " v
n n u n
" # #
u 2
r X ru X ru X
Rn (F | xn1 ) = E εi x i ≤ t E εi x i = t ∥xi ∥22 ,
n 2 n 2 n
i=1 i=1 i=1
as desired. 3
Example 5.2.8 (Rademacher complexity of the ℓ1 -ball): In contrast to the previous example,
suppose that Θ = {θ ∈ Rd | ∥θ∥1 ≤ r}, and consider the linear class F := {fθ (x) = θT x, θ ∈ Θ}.
Then
" n #
r X
Rn (F | xn1 ) = E εi x i .
n ∞ i=1
Now, each coordinate j of ni=1 εi xi is ni=1 x2ij -sub-Gaussian, and thus using that E[maxj≤d Zj ] ≤
P P
p
2σ 2 log d for arbitrary σ 2 -sub-Gaussian Zj (see Exercise 4.9), we have
v
u n
n r u X
Rn (F | x1 ) ≤ t2 log(2d) max x2ij .
n j
i=1
These examples are sufficient to derive a few sophisticated risk bounds. We focus on the case
where we have a loss function applied to some class with reasonable Rademacher complexity, in
which case it is possible to recenter the loss class and achieve reasonable complexity bounds. The
coming proposition does precisely this in the case of margin-based binary classification. Consider
points (x, y) ∈ X × {±1}, and let F be an arbitrary class of functions f : X → R and L =
{(x, y) 7→ ℓ(yf (x))}f ∈F be the induced collection of losses. As a typical example, we might have
ℓ(t) = [1 − t]+ , ℓ(t) = e−t , or ℓ(t) = log(1 + e−t ). We have the following proposition.
Proposition 5.2.9. Let F and X be such that supx∈X |f (x)| ≤ M for f ∈ F and assume that
ℓ is L-Lipschitz. Define the empirical and population risks L
b n (f ) := Pn ℓ(Y f (X)) and L(f ) :=
P ℓ(Y f (X)). Then
!
nt2
P sup |Ln (f ) − L(f )| ≥ 4LRn (F) + t ≤ 2 exp − 2 2
b for t ≥ 0.
f ∈F 2L M
Proof We may recenter the class L, that is, replace ℓ(·) with ℓ(·) − ℓ(0), without changing
b n (f ) − L(f ). Call this class L0 , so that ∥Pn − P ∥ = ∥Pn − P ∥ . This recentered class satisfies
L L L0
bounded differences with constant 2M L, as |ℓ(yf (x)) − ℓ(y ′ f (x′ ))| ≤ L|yf (x) − y ′ f (x′ )| ≤ 2LM ,
as in the proof of Proposition 5.1.1. Applying Proposition 5.1.1 and then Corollary 5.1.3 and gives
114
Lexture Notes on Statistics and Information Theory John Duchi
b n (f ) − L(f )| ≥ 2Rn (L0 ) + t) ≤ exp(− nt22 2 ) for t ≥ 0. Then applying the con-
that P(supf ∈F |L 2M L
traction inequality (Theorem 5.1.6) yields Rn (L0 ) ≤ 2LRn (F), giving the result.
Example 5.2.10 (Support vector machines and hinge losses): In the support vector machine
problem, we receive data (Xi , Yi ) ∈ Rd × {±1}, and we seek to minimize average of the losses
ℓ(θ; (x, y)) = 1 − yθT x + . We assume that the space X has ∥x∥2 ≤ b for x ∈ X and that
nt2
P sup |Pn ℓ(θ; (X, Y )) − P ℓ(θ; (X, Y ))| ≥ 4Rn (FΘ ) + t ≤ exp − 2 2 ,
θ∈Θ 2r b
where FΘ = {fθ (x) = θT x}θ∈Θ . Now, we apply Example 5.2.7, which implies that
2rb
Rn (ϕ ◦ FΘ ) ≤ 2Rn (Fθ ) ≤ √ .
n
nt2
4rb
P sup |Pn ℓ(θ; (X, Y )) − P ℓ(θ; (X, Y ))| ≥ √ + t ≤ exp − ,
θ∈Θ n 2(rb)2
√
so that Pn and P become close at rate roughly rb/ n in this case. 3
Example 5.2.11 (Lipschitz functions over a norm-bounded parameter space): Consider the
parametric loss minimization problem
for a loss function ℓ that is M -Lipschitz (with respect to the norm ∥·∥) in its argument, where
for normalization we assume inf θ∈Θ ℓ(θ, z) = 0 for each z. Then the metric entropy of Θ
bounds the metric entropy of the loss class F := {z 7→ ℓ(θ, z)}θ∈Θ for the supremum norm
∥·∥∞ . Indeed, for any pair θ, θ′ , we have
115
Lexture Notes on Statistics and Information Theory John Duchi
Assume that Θ ⊂ {θ | ∥θ∥ ≤ b} for some finite b. Then Lemma 5.1.10 guarantees that
log N (ϵ, Θ, ∥·∥) ≤ d log(1 + 2/ϵ) ≲ d log 1ϵ , and so the classical covering number argument in
Example 5.1.9 gives
nt2
M
P sup |Pn ℓ(θ, Z) − P ℓ(θ, Z)| ≥ t ≤ exp −c 2 2 + Cd log ,
θ∈Θ b M t
2 2d
where c, C are numerical constants. In particular, taking t2 ≍ M nb log nδ gives that
M b d log nδ
p
|Pn ℓ(θ, Z) − P ℓ(θ, Z)| ≲ √
n
with probability at least 1 − δ. 3
L∗ = inf L(f ),
f
where the preceding infimum is taken across all (measurable) functions. Then we have
More broadly, let {Fk }k∈N be a (possibly infinite) increasing sequence of function classes. We
assume that for each Fk and each n ∈ N, there exists a constant Cn,k (δ) such that we have the
uniform generalization guarantee
!
P sup L b n (f ) − L(f ) ≥ Cn,k (δ) ≤ δ · 2−k .
f ∈Fk
116
Lexture Notes on Statistics and Information Theory John Duchi
(We will see in subsequent sections of the course how to obtain other more general guarantees.)
We consider the following structural risk minimization procedure. First, given the empirical
risk L
b n , we find the model collection b
k minimizing the penalized risk
k := argmin inf L
b b n (f ) + Cn,k (δ) . (5.2.7a)
k∈N f ∈Fk
We then choose fb to minimize the risk over the estimated “best” class Fbk , that is, set
fb := argmin L
b n (f ). (5.2.7b)
f ∈Fkb
Theorem 5.2.12. Let fb be chosen according to the procedure (5.2.7a)–(5.2.7b). Then with proba-
bility at least 1 − δ, we have
On the event that supf ∈Fk |L b n (f ) − L(f )| < Cn,k (δ) for all k, which occurs with probability at least
1 − δ, we have
n o
L(fb) ≤ L
b n (f ) + C b (δ) = inf inf L
n,k
b n (f ) + Cn,k (δ) ≤ inf inf {L(f ) + 2Cn,k (δ)}
k∈N f ∈Fk k∈N f ∈Fk
We conclude with a final example, using our earlier floating point bound from Example 5.2.5,
coupled with Corollary 5.2.6 and Theorem 5.2.12.
Example 5.2.13 (Structural risk minimization with floating point classifiers): Consider
again our floating point example, and let the function class Fk consist of functions defined by
at most k double-precision floating point values, so that log |Fk | ≤ 45d. Then by taking
s
log 1δ + 65k log 2
Cn,k (δ) =
2n
117
Lexture Notes on Statistics and Information Theory John Duchi
we have that |Lb n (f )−L(f )| ≤ Cn,k (δ) simultaneously for all f ∈ Fk and all Fk , with probability
at least 1 − δ. Then the empirical risk minimization procedure (5.2.7) guarantees that
s
1
2 log δ + 91k
L(fb) ≤ inf inf L(f ) + .
k∈N f ∈Fk n
Roughly, we trade between small risk L(f )—as the qrisk inf f ∈Fk L(f ) must be decreasing in
k—and the estimation error penalty, which scales as (k + log 1δ )/n. 3
Example 5.3.2 (Logistic regression): In binary logistic regression (Example 3.4.2) with
labels y ∈ {±1}, we have data z = (x, y) ∈ Rd × {±1}, and the loss
ℓ(θ, z) = − log pθ (y | x) = log 1 + exp(−yx⊤ θ) ,
which is C ∞ , domain all of Rd , and Lipschitz-continuous derivatives of all orders. (Though the
particular Lipschitz constant depends on x). 3
Regardless, we will consider general rates of convergence and estimation error for estimators
minimizing the empirical loss
n
1X
Ln (θ) := Pn ℓ(θ, Z) = ℓ(θ, Zi ),
n
i=1
118
Lexture Notes on Statistics and Information Theory John Duchi
providing prototypical arguments for its convergence. Often, we will take Θ = Rd , though this will
not be essential.
Based on the results in the preceding sections on uniform convergence, one natural idea is to
use uniform convergence: if we can argue that
then so long as the minimizer θ⋆ of L is unique and L(θ) − L(θ⋆ ) grows with the distance ∥θ − θ⋆ ∥,
we necessarily have θbn → θ⋆ . Unfortunately, this naive approach typically fails to achieve the
correct convergence rates, let alone the correct dependence on problem parameters. We therefore
take another approach.
JCD Comment: Will need some figures here / illustrations
Recall that a twice differentiable function L is convex if and only if ∇2 L(θ) ⪰ 0 for all θ ∈ dom L.
We thus expect that L should have some quadratic growth around its minimizer θ⋆ , meaning that
in a neighborhood of θ⋆ , we have
λ
L(θ) ≥ L(θ⋆ ) + ∥θ − θ⋆ ∥22
2
for θ near enough θ⋆ . In such a situation, because the sampled Ln is also convex and approximates
L, we then expect that for parameters θ far enough from θ⋆ that the growth of L(θ) above L(θ⋆ )
dominates the noise inherent in the sampling, we necessarily have Ln (θ) > Ln (θ⋆ ). Because the
empirical minimizer θbn necessarily satisfies Ln (θbn ) ≤ Ln (θ⋆ ), we would never choose such a distant
parameter, thus implying a convergence rate. To make this type of argument rigorous requires a
bit of convex analysis and sampling theory; luckily, we are by now well-equipped to address this.
Assumption A.5.1 (Standard conditions). For each z ∈ Z, the losses ℓ(θ, z) are convex in θ.
There are constants M0 , M1 , M2 < ∞ such that for each z ∈ Z,
(ii) ∇2 ℓ(θ⋆ , z) op
≤ M1 , and
(iii) the Hessian ∇2 ℓ(θ, z) is M2 -Lipschitz continuous in a neighborhood of radius r > 0 around
θ⋆ , meaning ∇2 ℓ(θ0 , z) − ∇2 ℓ(θ1 , z) op ≤ M0 ∥θ0 − θ1 ∥2 whenever ∥θi − θ⋆ ∥2 ≤ r.
Additionally, the minimizer θ⋆ = argminθ L(θ) exists and for a λ > 0 satisfies
∇2 L(θ⋆ ) ⪰ λI.
119
Lexture Notes on Statistics and Information Theory John Duchi
The “standard conditions” in Assumption A.5.1 are not so onerous. As we see when we spe-
cialize our coming results to exponential family models in Section 5.3.4, Assumption A.5.1 holds
essentially as soon as the family is minimal and ϕ(x) is bounded. The existence of minimizers can
be somewhat more subtle to guarantee than the smoothness conditions (i)–(iii), though these are
typically straightforward. (For more on the existence of minimizers, see Exercise 5.10.)
To quickly highlight the conditions, we revisit binary logistic and robust regression.
Example (Example 5.3.2 continued): For logistic regression with labels y ∈ {±1}, we have
1
∇ℓ(θ, (x, y)) = − yx and ∇2 ℓ(θ, (x, y)) = pθ (y | x)(1 − pθ (y | x))xx⊤ .
1 + eyx⊤ θ
Then Assumptions (i)–(iii) hold so long as supx∈X ∥x∥2 < ∞, with M0 = supx∈X ∥x∥2 and
M1 = 14 M02 . We revisit the existence of minimizers in the sequel, noting that because 0 <
pθ (y | x)(1 − pθ (y | x)) ≤ 14 for any θ, x, y, if a minimizer exists then it is necessarily unique
as soon as E[XX ⊤ ] ≻ 0. 3
so that
∇ℓ(θ, (x, y)) = h′ (⟨x, θ⟩ − y)x and ∇2 ℓ(θ, (x, y)) = h′′ (⟨x, θ⟩ − y)xx⊤ .
et −1
A prototypical example is h(t) = log(1 + et ) + log(1 + e−t ), which satisfies h′ (t) = et +1 ∈
2et
[−1, 1] and h′′ (t)
= ∈ (0, 1/2]. So long as the covariates x ∈ X have finite radius
(et +1)2
rad2 (X ) := supx∈X ∥x∥2 , we obtain the Lipschitz constant bounds
for Assumption A.5.1, parts (i)–(iii). In general, if h is symmetric with h′′ (0) > 0, then
minimizers exist whenever Y is non-pathological and E[XX ⊤ ] ≻ 0. Exercise 5.11 asks you to
prove this last claim on existence of minimizers. 3
Lemma 5.3.4. Let h be convex and θ0 ∈ dom h and v an arbitrary vector. Then for all t ≥ 1,
h(θ0 + tv) − h(θ0 ) ≥ t(h(θ1 ) − h(θ0 )).
120
Lexture Notes on Statistics and Information Theory John Duchi
Now we connect these results to the growth of suitably smooth convex functions. Here, we wish
to argue that the minimizer of a convex function h is not too far from a benchmark point θ0 at
which h has strong upward curvature and small gradient.
JCD Comment: Figure.
Lemma 5.3.6. Let h be convex and λ > 0, γ ≥ 0, and ϵ ≥ 2γ λ . Assume that for some θ0 , we
have ∥∇h(θ0 )∥2 ≤ γ and ∇2 h(θ) ⪰ λI for all θ satisfying ∥θ − θ0 ∥2 ≤ ϵ. Then the minimizer
θb = argminθ h(θ) exists and satisfies
2γ
∥θb − θ0 ∥2 ≤ .
λ
Proof By Taylor’s theorem, for any θ we have
1
h(θ) = h(θ0 ) + ⟨∇h(θ0 ), θ − θ0 ⟩ + (θ − θ0 )⊤ ∇2 h(θ)(θ − θ0 )
2
for a point θ on the line between θ and θ0 . Now, let us take θ such that ∥θ − θ0 ∥2 ≤ t. Then by
assumption ∇2 h(θ) ⪰ λI, and so we have
λ
h(θ) ≥ h(θ0 ) + ⟨∇h(θ0 ), θ − θ0 ⟩ + ∥θ − θ0 ∥22
2
λ
≥ h(θ0 ) − γ ∥θ − θ0 ∥2 + ∥θ − θ0 ∥22 .
2
by assumption that ∥∇h(θ0 )∥2 ≤ γ and the Cauchy-Schwarz inequality.
Fix t ≥ 0. If we can show that h(θ) > h(θ0 ) for all θ satisfying ∥θ − θ0 ∥2 = t, then Lemma 5.3.5
implies that h(θ) > h(θ0 ) whenever ∥θ − θ0 ∥2 ≥ t, so that necessarily ∥θb − θ0 ∥2 < t. Returning to
the previous display and letting t = ∥θ − θ0 ∥2 , note that
λ 2
h(θ) ≥ h(θ0 ) − γt + t ,
2
121
Lexture Notes on Statistics and Information Theory John Duchi
Lemma 5.3.7. Let h be convex and assume that ∇2 h is M2 -Lipschitz (part (iii) of Assump-
λ2
tion A.5.1). Let λ > 0 be large enough and γ > 0 be small enough that γ < 8M 2
. Then if
2
both ∇ h(θ0 ) ⪰ λI and ∥∇h(θ0 )∥ ≤ γ, the minimizer θb = argmin h(θ) exists and satisfies
2 θ
4γ
∥θb − θ0 ∥2 ≤ .
λ
Proof By Lemma 5.3.6, it is enough to show that ∇2 h(θ) ⪰ λ2 for all θ with ∥θ − θ0 ∥2 ≤ 4γ λ . For
2
this, we use the M2 -Lipschitz continuity of ∇ h to obtain that for any θ with ∥θ − θ0 ∥2 = t,
∇2 L(θ⋆ ) ⪰ λI (5.3.2b)
with high probability. Happily, the convergence guarantees we develop in Chapter 4 provide pre-
cisely the tools to do this.
Theorem 5.3.8. Let Assumption A.5.1 hold for the M-estimation problem (5.3.1). Let δ ∈ (0, 12 ),
and define
2 ∇2 L(θ⋆ ) op
r ! ( r )
M0 1 2d 4M1 2d
γn (δ) := √ 1 + log and ϵn (δ) := max √ log , log .
n δ n δ 3n δ
Then we have both ∥∇Ln (θ⋆ )∥2 ≤ γn (δ) and ∇2 Ln (θ⋆ ) − ∇2 L(θ⋆ ) op
≤ ϵn (δ) with probability at
λ 9λ2
least 1 − 2δ. So long as ϵn (δ) ≤ 4 and γn (δ) ≤ 128M2 , then with the same probability,
16 γn (δ)
∥θbn − θ⋆ ∥2 ≤ .
3 λ
122
Lexture Notes on Statistics and Information Theory John Duchi
We defer the proof of the theorem to Section 5.3.5, instead providing commentary and a few
examples of its application. Ignoring the numerical constants, the theorem roughly states the
following: once n is large enough that
2
M22 M02 1 ∇2 L(θ⋆ ) op d
n≳ log and n ≳ log , (5.3.3)
λ4 δ λ2 δ
with probability at least 1 − δ we have
r
⋆ M 0 1
∥θbn − θ ∥2 ≲ √ log . (5.3.4)
λ n δ
These finite sample results are, at least for large n, order optimal, as we will develop in the coming
sections on fundamental limits. Nonetheless, the conditions (5.3.3) are stronger than necessary,
typically requiring that n be quite large. In the exercises, we explore a class of quasi-self-concordant
losses, where the second derivative controls the third derivative, allowing more direct application
of Lemma 5.3.6, which allows reducing this sample size requirement. (See Exercises 5.5 and 5.6).
Example 5.3.9 (Logistic regression, Example 5.3.2 continued): Recalling the logistic loss
ℓ(θ, (x, y)) = log(1 + e−y⟨x,θ⟩
√ ) for y ∈ {±1} and x ∈ Rd , assume the domain X consists√of
vectors x with ∥x∥2 ≤ d. For example, if X ⊂ [−1, 1]d , this holds. In this case, M0 = d,
while M1 ≤ 14 d and M2 ≲ d3/2 . Assuming that the population Hessian ∇2 L(θ⋆ ) = E[pθ⋆ (Y |
X)(1 − pθ⋆ (Y | X))XX ⊤ ] has minimal eigenvalue λmin (∇2 L(θ⋆ )) ≳ 1, then the conclusions of
Theorem 5.3.8 apply as soon as n ≳ d4 log 1δ . 3
When n is large enough, the guarantee (5.3.4) allows us to also make the heuristic asymptotic
expansions for the exponential family models in Section 3.2.1 hold in finite samples. Let the
conclusions of Theorem 5.3.8 hold, so that ∇2 Ln (θ⋆ ) − ∇2 L(θ⋆ ) op ≤ ϵn (δ) and so on. Then once
we know that θbn exists, by a Taylor expansion we can write
0 = ∇Ln (θbn ) = ∇Ln (θ⋆ ) + (∇2 Ln (θ⋆ ) + En )(θbn − θ⋆ )
= ∇Ln (θ⋆ ) + (∇2 L(θ⋆ ) + E ′ )(θbn − θ⋆ ),
n
where En is an error matrix satisfying ∥En ∥op ≤ M2 ∥θbn − θ⋆ ∥2 and En′ = En + ∇2 Ln (θ⋆ ) − ∇2 L(θ⋆ )
satisfies ∥En′ ∥op ≤ ∥En ∥op + ϵn (δ) by the triangle inequality and Theorem 5.3.8. Using the infinite
series expansion of the inverse
∞
X
−1 −1
(A + E) =A + (−1)i (A−1 E)i A−1 ,
i=1
valid for A ≻ 0 whenever ∥E∥op < λmin (A) (see Exercise 5.4), we therefore have
θbn − θ⋆ = −(∇2 L(θ⋆ ) + En′ )−1 ∇Ln (θ⋆ ) = −∇2 L(θ⋆ )−1 ∇Ln (θ⋆ ) + Rn ,
where the remainder vector Rn satisfies
∥Rn ∥2 ≲ ∇2 L(θ⋆ )−1 En′ ∇2 L(θ⋆ )−1 op ∥∇Ln (θ⋆ )∥2
q
1
M
1 2 0M log δ log dδ
M2 M02
⋆
≲ 2 √ + ϵn (δ) ∥Ln (θ )∥2 ≲ 2 ·
+ ∇2 L(θ⋆ ) op
λ λ n λ n λ
123
Lexture Notes on Statistics and Information Theory John Duchi
Corollary 5.3.10. Let the conditions of Theorem 5.3.8 hold. Then there exists a problem-dependent
constant C such that the following holds: for any δ > 0, for any n ≥ C log 1δ , with probability at
least 1 − δ
θbn − θ⋆ = −∇2 L(θ⋆ )−1 ∇Ln (θ⋆ ) + Rn ,
where the remainder Rn satisfies ∥Rn ∥2 ≤ C · n1 log dδ . The constant C may be taken to be continuous
in all the problem parameters of Assumption A.5.1.
Corollary 5.3.10 highlights the two salient terms governing error in estimation problems: the
curvature of the loss near the optimum, as ∇2 L(θ⋆ ) contributes, and the variance in the gradients
∇Ln (θ⋆ ) = n1 ni=1 ∇ℓ(θ⋆ , Zi ). When the Hessian term ∇2 L(θ⋆ ) is “large,” meaning that ∇2 L(θ⋆ ) ⪰
P
λI for some large value λ > 0, then estimation is easier: the curvature of the loss helps to identify
θ⋆ . Conversely, when the variance Var(∇ℓ(θ⋆ , Z)) = E[∥∇ℓ(θ⋆ , Z)∥22 ] is large, then estimation is
more challenging. As a final remark, let us imagine that the remainder term Rn in the corollary,
in addition to being small with high probability, satisfies E[∥Rn ∥22 ] ≤ nC2 , where C is a problem-
dependent constant. Let Gn = −∇2 L(θ⋆ )−1 ∇Ln (θ⋆ ) be the leading term in the expansion, which
satisfies
1 tr(∇2 L(θ⋆ )−1 Cov(∇ℓ(θ⋆ , Z))∇2 L(θ⋆ )−1 )
E[∥Gn ∥22 ] = Var(L(θ⋆ )−1 ∇ℓ(θ⋆ , Z)) = .
n n
Then because ∥Gn + Rn ∥22 ≤ ∥Gn ∥22 + ∥Rn ∥22 + 2 ∥Gn ∥2 ∥Rn ∥2 , we have the heuristic
h i h i
2
E θbn − θ⋆ 2 = E ∥Gn + Rn ∥22
= E[∥Gn ∥22 ] + E[∥Rn ∥22 ] ± 2E[∥Gn ∥2 ∥Rn ∥2 ]
(⋆) tr(∇2 L(θ⋆ )−1 Cov(∇ℓ(θ⋆ , Z))∇2 L(θ⋆ )−1 ) C
= ± 3/2 , (5.3.5)
n n
where C is a problem-dependent constant and the step (⋆) is heuristic. See Exercise 5.7 for one
approach to make this step rigorous.
Corollary 5.3.10 presents one approach to make this rigorous. Assuming the sufficient statistics ϕ
are bounded, we have ∇2 L(θ⋆ ) = ∇2 A(θ⋆ ), and Covθ⋆ (∇Ln (θ⋆ )) = n1 Covθ⋆ (ϕ(X)) = n1 ∇2 A(θ⋆ ).
So
n
1X
θbn − θ⋆ = −∇2 A(θ⋆ )−1 (ϕ(Xi ) − ∇A(θ⋆ )) + Rn ,
n
i=1
124
Lexture Notes on Statistics and Information Theory John Duchi
is to consider the estimator θbn only on some “good” event En that occurs with high probability, for
example, that the remainder Rn is small. The next corollary provides a prototypical result under the
iid
assumption that Xi ∼ Pθ⋆ for an exponential family model with bounded data supx∈X ∥ϕ(x)∥2 < ∞
and positive definite Hessian ∇2 A(θ⋆ ) ≻ 0.
Corollary 5.3.11. Under the preceding conditions on the exponential family model Pθ⋆ , there exists
a problem dependent constant C < ∞ such that the following holds: for any k ≥ 1, there are events
En with probability P(En ) ≥ 1 − n1k and
h i 1 Ck log n
2
Eθ⋆ θbn − θ⋆ 2
· 1 {En } ≤ tr ∇2 A(θ⋆ )−1 + .
n n3/2
The constant C may be taken continuous in θ⋆ .
Recalling the equality (3.3.2), we see that the Fisher information ∇2 A(θ) appears in a fundamental
way for the exponential families. Proposition 3.5.1 shows that this quantity is fundamental, at
least for testing; here it provides an upper bound on the convergence of the maximum likelihood
estimator. Exercise 5.8 extends Corollary 5.3.11 to an equality to within lower order terms.
Proof Let δ = δn > 0 to be chosen and define the event En to be that ∥Rn ∥2 ≤ C n1 log dδ , which
occurs with probability at least 1 − δ. Let
Gn = −∇2 L(θ⋆ )−1 ∇Ln (θ⋆ ) = −∇2 A(θ⋆ )−1 Pn (ϕ(X) − ∇A(θ⋆ ))
be the mean-zero gradient term. Then θbn = Gn + Rn , and ∥Gn + Rn ∥22 ≤ ∥Gn ∥22 + ∥Rn ∥22 +
2 ∥Gn ∥2 ∥Rn ∥2 . Then on the event En we have ∥Rn ∥22 ≤ Cn log dδ , and so
h
2
i C2 d
E θbn − θ⋆ 2
1 {En } ≤ E[∥Gn ∥22 ] + E[∥Rn ∥22 1 {En }]E[∥Gn ∥2 ] + 2
log2 .
n δ
Now note that E[∥Gn ∥22 ] = 1
n tr(∇2 A(θ⋆ )−1 ), and set δ = 1
nk
.
These ideas also extend to generalized linear models, such as linear, logistic, or Poisson regression
(recall Chapter 3.4). For the abstract generalized linear model of predicting a target y ∈ Y from
covariates x ∈ X , we have
Because the log partition is C ∞ , the smoothness conditions in Assumption A.5.1 then reduce to the
boundedness
rad2 ({ϕ(x, y) | x ∈ X , y ∈ Y}) := sup ∥ϕ(x, y)∥2 < ∞.
x∈X ,y∈Y
Assuming that a minimizer θ⋆ = argminθ L(θ) exists, the (local) strong convexity condition that
∇2 L(θ⋆ ) ≻ 0 then becomes that E[∇2 A(θ | X)] = E[Covθ (ϕ(X, Y ) | X)] ≻ 0. Exercise 5.10, part (c)
gives general sufficient conditions for the existence of minimizers in GLMs.
For logistic regression (Example 3.4.2), these conditions correspond to a bound on the covariate
data x, that E[XX ⊤ ] ≻ 0, and that for each X, the label Y is non-deterministic. For Poisson
⊤
regression (Example 3.4.4), we have ℓ(θ, (x, y)) = −yx⊤ θ + eθ x . When the count data Y ∈ N can
be unbounded, Assumption A.5.1.(i) may fail, because yx may be unbounded. If we model a data-
generating process for which X and Y are both bounded using Poisson regression, however, then
125
Lexture Notes on Statistics and Information Theory John Duchi
the smoothness conditions in Assumption A.5.1 hold. (Again, see Exercise 5.10 for the existence
of solutions.)
Regardless, by an argument completely parallel to that for Corollary 5.3.10, we can provide
convergence rates for generalized linear model estimators. Here, we avoid the assumption of model
fidelity, instead assuming that θ⋆ = argminθ L(θ) exists and ∇2 L(θ⋆ ) = E[∇2 A(θ⋆ | X)] ≻ 0, so
that θ⋆ is unique.
Corollary 5.3.12. Let the preceding conditions hold and pθ be a generalized linear model. Then
there exists a problem constant C < ∞ such the following holds: for any δ ∈ (0, 1) and for all
n ≥ C log 1δ , with probability at least 1 − δ
When the generalized linear model Pθ is correct, so that X ∼ P marginally and Y | X ∼ Pθ (· | X),
then Cov(ϕ(X, Y ) | X) = ∇2 A(θ⋆ | X), and so in this case (again, for a sequence of events En with
probability at least 1 − 1/nk ), we have
h
2
i tr(E[∇2 A(θ⋆ | X)]−1 ) Ck log n
Eθ⋆ θbn − θ⋆ 2
1 {En } ≤ + .
n n3/2
Note that this quantity is the trace of the inverse Fisher information (3.3.2) in the generalized
linear model: the “larger” the information, the better estimation accuracy we can guarantee.
Lemma 5.3.13. Let Assumption A.5.1 hold. Then for any δ ∈ (0, 1), with probability at least 1 − δ
r !
M 0 1
∥∇Ln (θ⋆ )∥2 ≤ √ 1 + 2 log .
n δ
Proof The function z1n 7→ ∥Ln (θ)∥2 satisfies bounded differences: for any two empirical samples
Pn , Pn′ differing in only observation i,
126
Lexture Notes on Statistics and Information Theory John Duchi
Lemma 5.3.14. Let Assumption A.5.1 hold. Then for any δ ∈ (0, 1), with probability at least 1 − δ
2 ∇2 L(θ⋆ ) op
( r )
2d 4M 1 2d
∇2 Ln (θ⋆ ) − ∇2 L(θ⋆ ) op ≤ max √ log , log .
n δ 3n δ
1 Pn
Proof Because ∇2 Ln (θ) = n i=1 ∇
2 ℓ(θ, Z
i) and ∇2 ℓ(θ, z) op
≤ M1 by Assumption [Link],
Theorem 4.3.6 implies that
( )!
nt2 3t
P ∇2 Ln (θ⋆ ) − ∇2 L(θ⋆ ) op
≥ t ≤ 2d exp − min 2 , 4M .
2 ⋆
4 ∥∇ L(θ )∥op 1
Let ϵn (δ) be the bound on the right side of Lemma 5.3.14. Then with probability at least 1 − δ,
3λ
∇2 Ln (θ⋆ ) ⪰ L(θ⋆ ) − ϵn (δ)Id ≥ Id
4
by the assumption that n is large enough and ∇2 L(θ⋆ ) ⪰ λI; we therefore have the first condition
of Lemma 5.3.7 where 3λ/4 replaces λ. Lemma 5.3.13 gives that with probability at least 1 − δ,
9λ2
∥∇Ln (θ⋆ )∥2 ≤ γn (δ). Now use the assumption that γn (δ) ≤ 128M 2
, so that Lemma 5.3.7 implies
16 γn (δ)
∥θbn − θ⋆ ∥2 ≤ ,
3 λ
which proves the theorem.
5.4 Exercises
JCD Comment: Exercise ideas around this: We could try to do things with moment
bounds. Like we’d use something like the Marcinkiewicz bounds, and then some moment
bounds on matrices from different exercises, and we could say something.
Probably also write some moment bound guarantees for matrices with operator norms
would be really neat.
Also, exercises that handle dimension scaling could be fun, along with (associated) con-
vergence rates.
Exercise 5.1: In this question, we show how to use Bernstein-type (sub-exponential) inequal-
ities to give sharp convergence guarantees. Recall (Example 4.1.14, Corollary 4.1.18, and inequal-
ity (4.1.6)) that if Xi are independent bounded random variables with |Xi − E[X]| ≤ b for all i and
Var(Xi ) ≤ σ 2 , then
n n
( ! !)
5 nt2 nt
1X 1X 1
max P Xi ≥ E[X] + t , P Xi ≤ E[X] − t ≤ exp − min , .
n n 2 6 σ 2 2b
i=1 i=1
We consider minimization of loss functions ℓ over finite function classes F with ℓ ∈ [0, 1], so that if
L(f ) = E[ℓ(f, Z)] then |ℓ(f, Z) − L(f )| ≤ 1. Throughout this question, we let
127
Lexture Notes on Statistics and Information Theory John Duchi
We will show that, roughly, a procedure based on picking an empirical risk minimizer is unlikely to
choose a function f ∈ F with bad performance, so that we obtain faster concentration guarantees.
nt2
1 5 nt
P L(f ) ≥ L(f ) + t ∨ P L(f ) ≤ L(f ) − t ≤ exp − min
b b , .
2 6 L(f )(1 − L(f )) 2
(b) Define the set of “bad” prediction functions Fϵ bad := {f ∈ F : L(f ) ≥ L⋆ + ϵ}. Show that for
any fixed ϵ ≥ 0 and any f ∈ F2ϵ bad , we have
nϵ2
b ) ≤ L⋆ + ϵ ≤ exp − 1 min 5 nϵ
P L(f , .
2 6 L⋆ (1 − L⋆ ) + ϵ(1 − ϵ) 2
nϵ2
⋆
1 5 nϵ
P L(fn ) ≥ L(f ) + 2ϵ ≤ card(F) · exp − min
b , .
2 6 L⋆ (1 − L⋆ ) + ϵ(1 − ϵ) 2
(d) Using the result of part (c), argue that with probability at least 1 − δ,
q
|F | L⋆ (1 − L⋆ ) · log |Fδ |
r
4 log 12
L(fbn ) ≤ L(f ⋆ ) + δ
+ · √ .
n 5 n
Why is this better than an inequality based purely on the boundedness of the loss ℓ, such as
Theorem 5.2.4 or Corollary 5.2.6? What happens when there is a perfect risk minimizer f ⋆ ?
128
Lexture Notes on Statistics and Information Theory John Duchi
1
(A + E)−1 − Sk op
≤ γ k+1 .
λmin (A)(1 − γ)
(a) Define g(t) = log f ′′ (t). Show that if f is C-q.s.c., then |g ′ (t)| ≤ C and so for any s ∈ R,
(b) Show that the function f (t) = log(1 + et ) + log(1 + e−t ) is q.s.c., and give its self-concordance
parameter.
(c) Show that the function f (t) = log(1 + et ) is q.s.c., and give its self-concordance parameter.
(d) Let f be C-q.s.c. and for a fixed x ∈ Rd , define h(θ) := f (⟨θ, x⟩). Show that for any θ, θ0 with
∆ = θ − θ0 , h satisfies
(a) Show that logistic regression with loss ℓ(θ, (x, y)) = log(1+e−y⟨x,θ⟩ ) and robust linear regression
with loss ℓ(θ, (x, y)) = log(1 + ey−⟨x,θ⟩ ) + log(1 + e⟨x,θ⟩−y ) are both 1-q.s.c. losses.
√
For the remainder of the problem, assume that the data X ⊂ Rd satisfy ∥x∥2 ≤ d for all x ∈ X .
Let L(θ) = E[ℓ(θ, (X, Y ))] and θ⋆ = argminθ L(θ). Assume that ∇2 L(θ⋆ ) ⪰ λI, where λ > 0 is
fixed, and let Ln (θ) = Pn ℓ(θ, (X, Y )) as usual.
√
(b) Show that if ∥θ − θ⋆ ∥2 ≤ 1/ d, then ∇2 Ln (θ) ⪰ e−C ∇2 Ln (θ⋆ ).
(d) Let ℓ be a 1-q.s.c. loss and assume that |h′ (t, y)| ≤ 1 and |h′′ (t, y)| ≤ 1 for all t ∈ R. Give a
result similar to that of Theorem 5.3.8, but show that your conclusions hold with probability
at least 1 − δ as soon as
d2 d
n ≳ 2 log .
λ δ
129
Lexture Notes on Statistics and Information Theory John Duchi
Exercise 5.7 (Truncation to obtain a moment bound): Let B < ∞. Show that under the
conditions of Corollary 5.3.10,
h i tr(∇2 L(θ⋆ )−1 Cov(∇ℓ(θ⋆ , Z))∇2 L(θ⋆ )−1 ) C log n
2
E θbn − θ⋆ 2
∧B ≤ + ,
n n3/2
where C is a problem-dependent constant.
Exercise 5.8: Let θbn = argminθ Ln (θ) for an M-estimation problem satisfying the conditions of
Corollary 5.3.10. Show that for any k ≥ 1, there are events En with P(En ) ≥ 1 − n−k and for which
h
2
i 1 Ck log n
θbn − θ⋆ tr ∇2 L(θ⋆ )−1 Cov(∇ℓ(θ⋆ , Z))∇2 L(θ⋆ )−1 ≤
E 2
1 {En } − ,
n n3/2
(a) Let Pθ be a minimal exponential family (Definition 3.2) with density pθ (x) = exp(θ⊤ x − A(θ))
with respect to a base measure µ. Show that for any θ⋆ ∈ dom A, the well-specified population
loss L(θ) = −Eθ⋆ [log pθ (X)] has unique minimizer θ⋆ .
For the remainder of the problem, we no longer assume the model is well-specified. For a measure
µ on a set X , recall that the essential supremum of a function f : X → R is
(b) Let {Pθ } be the exponential family with density pθ (x) = exp(θ⊤ x − A(θ)) with respect to the
measure µ. Let X be a random variable for which µ essentially covers (5.4.1) the mean E[X].
Show that L(θ) = −E[log pθ (X)] has a minimizer.
Hint. A continuous convex function h has a minimizer if it is coercive, meaning that h(θ) → ∞
whenever ∥θ∥2 → ∞. Corollary C.3.7, part (i) may be useful.
Exercise 5.10: In this problem, you provide sufficient conditions for generalized linear models to
have minimizers. Let L(θ) = E[ℓ(θ, (X, Y ))] be the population loss (which may be misspecified).
⊤x
(a) Consider Poisson regression with loss ℓ(θ, (x, y)) = −yx⊤ θ + eθ . Show that L has a unique
minimizer if E[X] = 0 and Cov(X) = E[XX ⊤ ] ≻ 0.
⊤
(b) Consider logistic regression with loss ℓ(θ, (x, y)) = −yx⊤ θ + log(1 + eθ x ) for y ∈ {0, 1}. Show
that L has a unique minimizer if E[XX ⊤ ] ≻ 0 and 0 < P(Y = 1 | X) < 1 with probability 1
over X.
130
Lexture Notes on Statistics and Information Theory John Duchi
(c) Consider a generalized linear model with densities pθ (y | x) = exp(ϕ(x, y)⊤ θ − A(θ | x)) w.r.t.
a base measure µ(· | x) on y ∈ Y, and assume for simplicity that µ(Y | x) = 1 for all x. Assume
that for each vector v ∈ Sd−1 and x ∈ X ,
ess sup{v ⊤ ϕ(x, y)} ≥ EP [v ⊤ ϕ(x, Y ) | X = x],
µ(·|x)
and the set of x for which a strict inequality holds has positive P -probability. (This is equivalent
to the set of x for which µ(· | x) essentially covers (5.4.1) the conditional mean EP [ϕ(x, Y ) |
X = x] having positive probability.) Show that a minimizer of L exists. You may assume
E[|A(θ | X)|] < ∞ for ∥θ∥2 ≤ 1 if it is convenient.
Hint. The techniques to solve Exercise 5.9 may be useful. In addition, see Exercise 4.2.
Exercise 5.11: Consider the robust regression setting of Example 5.3.3, and let h ≥ 0 be a
symmetric convex function, twice continuously differentiable in a neighborhood of 0. Assume that
for any (measurable) subset X0 ⊂ X , E[Y | X ∈ X0 ] exists and is finite, and assume E[XX ⊤ ] ≻
0. Show that a minimizer of L(θ) := E[h(⟨θ, X⟩ − Y )] exists. Hint. Show that L is coercive.
Corollary C.3.7, part (i) may be useful.
Exercise 5.12 (The delta method for approximate sums): Let T : Rd → Rp be a differentiable
function with derivative matrix Ṫ (θ) ∈ Rp×d , so that T (θ + ∆) = T (θ) + Ṫ (θ)∆ + o(∥∆∥) as ∆ → 0.
Let θbn ∈ Rd be a sequence of random vectors with
θbn − θ = Pn Z + Rn ,
where Zi are i.i.d. and Rn is a remainder term.
√
(a) Assume that E[∥Zi ∥22 ] < ∞ and that for each ϵ > 0, P(∥Rn ∥2 ≥ ϵ/ n) → 0 as n → ∞. Show
that
θbn − θ = Ṫ (θ)Pn Z + Rn′ ,
√
where the remainder Rn′ also satisfies P(∥Rn′ ∥2 ≥ ϵ/ n) → 0 for all ϵ > 0.
(b) Assume that T is locally smooth enough that for some K < ∞ and δ > 0,
T (θ + ∆) − T (θ) − Ṫ (θ)∆ ≤ K ∥∆∥22
2
when ∥∆∥2 ≤ δ. Assume additionally that there exist C0 , C1 < ∞ such that for t ≥ 0, we have
∥Rn ∥2 ≤ Cn0 t with probability at least 1 − e−t and that P(∥Pn Z∥2 ≥ t) ≤ C1 exp(−nt2 /σ 2 ).
Give a quantitative version of part (a).
JCD Comment: Add in some connections to the exponential family material. Some
ideas:
1. A hypothesis test likelihood ratio for them (see page 40 of handwritten notes)
2. A full learning guarantee with convergence of Hessian and everything, e.g., for logistic
regression?
3. In the Ledoux-Talagrand stuff, maybe worth going through example of logistic regres-
sion. Also, having working logistic example throughout? Helps clear up the structure
and connect with exponential families.
131
Chapter 6
Concentration inequalities provide powerful techniques for demonstrating when random objects
that are functions of collections of independent random variables—whether sample means, functions
with bounded variation, or collections of random vectors—behave similarly to their expectations.
This chapter continues exploration of these ideas by incorporating the central thesis of this book:
that information theory’s connections to statistics center around measuring when (and how) two
probability distributions get close to one another. On its face, we remain focused on the main
objects of the preceding chapter, where we have a population probability distribution P on a space
X and some collection of functions f : X → R. We then wish to understand when we expect the
empirical distribution
n
1X
Pn := 1Xi ,
n
i=1
iid
defined by the sample Xi ∼ P , to be close to the population P as measured by f . Following the
notation we introduce in Section 5.1, for P f := EP [f (X)], we again ask to have
n
1X
Pn f − P f = f (Xi ) − EP [f (X)]
n
i=1
132
Lexture Notes on Statistics and Information Theory John Duchi
where the supremum is taken over measurable functions g : X → R with EQ [eg(X) ] < ∞. We can
also replace this by bounded simple functions g.
We give one proof of this result and one sketch of a proof, which holds when the underlying space
is discrete, that may be more intuitive: the first constructs a particular “tilting” of Q via the
function eg , and verifies the equality. The second relies on the discretization of the KL-divergence
and may be more intuitive to readers familiar with convex optimization: essentially, we expect this
result because the function log( kj=1 exj ) is the convex conjugate of the negative entropy. (See also
P
Exercise 6.1.)
Proof We may assume that P is absolutely continuous with respect to Q, meaning that Q(A) = 0
implies that P (A) = 0, as otherwise both sides are infinite by inspection. Thus, it is no loss of
generality to let P and Q have densities p and q.
Attainment in the equality is easy: we simply take g(x) = log p(x) q(x) , so that EQ [e
g(X) ] = 1. To
show that the right hand side is never larger than Dkl (P ||Q) requires a bit more work. To that
end, let g be any function such that EQ [eg(X) ] < ∞, and define the random variable Zg (x) =
eg(x) /EQ [eg(X) ], so that EQ [Z] = 1. Then using the absolute continuity of P w.r.t. Q, we have
p(X) q(X) dQ
EP [log Zg ] = EP log + log Zg (X) = Dkl (P ||Q) + EP log Zg
q(X) p(X) dP
dQ
≤ Dkl (P ||Q) + log EP Zg
dP
= Dkl (P ||Q) + log EQ [Zg ].
As EQ [Zg ] = 1, using that EP [log Zg ] = EP [g(X)] − log EQ [eg(X) ] gives the result.
For the claim that bounded simple functions are sufficient, all we need to do is demonstrate
(asymptotic) achievability. For this, we use the definition (2.2.1) of the KL-divergence as a
supremum over partitions. Take An be an sequence of partitions so that Dkl (P ||Q | An ) →
P P (A)
Dkl (P ||Q). Then let gn (x) = A∈An 1 {x ∈ A} log Q(A) , which gives Dkl (P ||Q | An ) = EP [gn (X)]−
log EQ [egn (X) ].
Here is the second proof of Theorem 6.1.1, which applies when X is discrete and finite. That we
can approximate KL-divergence by suprema over finite partitions (as in definition (2.2.1)) suggests
that this approach works in general—which it can—but this requires some not completely trivial
133
Lexture Notes on Statistics and Information Theory John Duchi
approximations of EP [g] and EQ [eg ] by discretized versions of their expectations, which makes
things rather tedious.
Proof of Theorem 6.1.1, the finite case As we have assumed that P and Q have finite
supports, which we identify with {1, . . . , k} and p.m.f.s p, q ∈ ∆k = {p ∈ Rk+ | ⟨1, p⟩ = 1}. Define
fq (v) = log( kj=1 qj evj ), which is convex in v (recall Proposition 3.2.1). Then the supremum in
P
the variational representation takes the form
If we can take derivatives and solve for zero, we are guaranteed to achieve the supremum. To that
end, note that
" #k
qi evi
∇v {⟨p, v⟩ − fq (v)} = p − Pk ,
vj
j=1 qj e i=1
p
so that setting vj = log qjj achieves p − ∇v fq (v) = p − p = 0 and hence the supremum. Noting that
p
log( kj=1 qj exp(log qjj )) = log( kj=1 pj ) = 0 gives h(p) = Dkl (p||q).
P P
The Donsker-Varadhan variational representation already gives a hint that we can use some
information-theoretic techniques to control the difference between an empirical sample and its
expectation, at least in an average sense. In particular, we see that for any function g, we have
for any random variable X. Now, changing this on its head a bit, suppose that we consider a
collection of functions F and put two probability measures π and π0 on F, and consider Pn f − P f ,
where we consider f a random variable f ∼ π or f ∼ π0 . Then a consequence of the Donsker-
Varadhan theorem is that
Z Z
(Pn f − P f )dπ(f ) ≤ Dkl (π||π0 ) + log exp(Pn f − P f )dπ0 (f )
for any π, π0 . While this inequality is a bit naive—bounding a difference by an exponent seems
wasteful—as we shall see, it has substantial applications when we can upper bound the KL-
divergence Dkl (π||π0 ).
134
Lexture Notes on Statistics and Information Theory John Duchi
expectation of f under P . The main theorem of this section shows that averages of the squared
error (Pn f − P f )2 of the empirical distribution Pn to P converge quickly to zero for all averaging
distributions π on functions f ∈ F so long as each f is σ 2 -sub-Gaussian, with the caveat that we
pay a cost for different choices of π. The key is that we choose some prior distribution π0 on F
first.
Theorem 6.2.1. Let Π be the collection of all probability distributions on the set F and let π0 be
a fixed prior probability distribution on f ∈ F. With probability at least 1 − δ,
PWithout loss of generality, we assume that P f = 0 for all f ∈ F, and recall that Pn f =
1 n 2
n i=1 f (Xi ) is the empirical mean of f . Then we know that Pn f is σ /n-sub-Gaussian, and
−1/2
Lemma 6.2.2 implies that E[exp(λ(Pn f )2 )] ≤ 1 − 2λσ 2 /n +
for any f , and thus for any prior
π0 on f we have Z
−1/2
exp(λ(Pn f ) )dπ0 (f ) ≤ 1 − 2λσ 2 /n + .
2
E
3n
Consequently, taking λ = λn := 8σ 2
, we obtain
Z Z
2 3n 2
E exp(λn (Pn f ) )dπ0 (f ) = E exp (Pn f ) dπ0 (f ) ≤ 2.
8σ 2
This holds without any probabilistic qualifications, so using the application (6.2.1) of Markov’s
inequality with λ = λn , we thus see that with probability at least 1 − δ over X1 , . . . , Xn , simulta-
neously for all distributions π,
135
Lexture Notes on Statistics and Information Theory John Duchi
By Jensen’s inequality (or Cauchy-Schwarz), it is immediate from Theorem 6.2.1 that we also
have
s
8σ 2 Dkl (π||π0 ) + log 2δ
Z
|Pn f − P f |dπ(f ) ≤ simultaneously for all π ∈ Π (6.2.2)
3 n
√
with probability at least 1 − δ, so that Eπ [|Pn f − P f |] is with high probability of order 1/ n. The
inequality (6.2.2) is the original form of the PAC-Bayes bound due to McAllester, with slightly
sharper constants and improved logarithmic dependence. The key is that stability, in the form of a
prior π0 and posterior π closeness, allow us to achieve reasonably tight control over the deviations
of random variables and functions with high probability.
Let us give an example, which is similar to many of our approaches in Section 5.2, to illustrate
some of the approaches this allows. The basic idea is that by appropriate choice of prior π0
and “posterior” π, whenever we have appropriately smooth classes of functions we achieve certain
generalization guarantees.
Example 6.2.3 (A uniform law for Lipschitz functions): Consider a case as in Section 5.2,
where we let L(θ) = P ℓ(θ, Z) for some function ℓ : Θ × Z → R. Let Bd2 = {v ∈ Rd | ∥v∥2 ≤ 1}
be the ℓ2 -ball in Rd , and let us assume that Θ ⊂ rBd2 and additionally that θ 7→ ℓ(θ, z) is
M -Lipschitz for all z ∈ Z. For simplicity, we assume that ℓ(θ, z) ∈ [0, 2M r] for all θ ∈ Θ (we
may simply relativize our bounds by replacing ℓ by ℓ(·, z) − inf θ∈Θ ℓ(θ, z) ∈ [0, 2M r]).
If L
b n (θ) = Pn ℓ(θ, Z), then Theorem 6.2.1 implies that
s
2 2
Z
b n (θ) − L(θ)|dπ(θ) ≤ 8M r Dkl (π||π0 ) + log 2
|L
3n δ
for all π with probability at least 1 − δ. Now, let θ0 ∈ Θ be arbitrary, and for ϵ > 0 (to be
chosen later) take π0 to be uniform on (r + ϵ)Bd2 and π to be uniform on θ0 + ϵBd2 . Then we
immediately see that Dkl (π||π0 ) = d log(1+ rϵ ). Moreover, we have L
R
b n (θ)dπ(θ) ∈ L
b n (θ0 )±M ϵ
and similarly for L(θ), by the M -Lipschitz continuity of ℓ. For any fixed ϵ > 0, we thus have
s
2M 2 r2
r 2
|Ln (θ0 ) − L(θ0 )| ≤ 2M ϵ +
b d log 1 + + log
3n ϵ δ
rd
simultaneously for all θ0 ∈ Θ, with probability at least 1 − δ. By choosing ϵ = n we obtain
that with probability at least 1 − δ,
s
2 r2
2M rd 8M n 2
sup |L
b n (θ) − L(θ)| ≤ + d log 1 + + log .
θ∈Θ n 3n d δ
q
Thus, roughly, with high probability we have |L
b n (θ) − L(θ)| ≤ O(1)M r d
n log nd for all θ. 3
On the one hand, the result in Example 6.2.3 is satisfying: it applies to any Lipschitz function
and provides a uniform bound. On the other hand, when we compare to the results achievable for
136
Lexture Notes on Statistics and Information Theory John Duchi
specially structured linear function classes, then applying Rademacher complexity bounds—such
as Proposition 5.2.9 and Example 5.2.10—we have somewhat weaker results, in that they depend
on the dimension explicitly, while the Rademacher bounds do not exhibit this explicit dependence.
This means they can potentially apply in infinite dimensional spaces that Example 6.2.3 cannot.
We will give an example presently showing how to address some of these issues.
we have that
E[exp(n∆n (f, λ))] = E[exp(λ(f (X) − P f ) − λ2 σ 2 (f ))]n ≤ 1
1
for all n, f ∈ F, and |λ| ≤ 2b .Then, for any fixed measure π0 on F, Markov’s inequality implies
that Z
1
P exp(n∆n (f, λ))dπ0 (f ) ≥ ≤ δ. (6.2.4)
δ
137
Lexture Notes on Statistics and Information Theory John Duchi
Now, as in the proof of Theorem 6.2.1, we use the Donsker-Varadhan Theorem 6.1.1 (change of
measure), which implies that
Z Z
n ∆n (f, λ)dπ(f ) ≤ Dkl (π||π0 ) + log exp(n∆n (f, λ))dπ0 (f )
for all distributions π. Using inequality (6.2.4), we obtain that with probability at least 1 − δ,
Z
1 1
∆n (f, λ)dπ(f ) ≤ Dkl (π||π0 ) + log
n δ
for all π. As this holds for any fixed |λ| ≤ 1/(2b), this gives the desired result by rearranging.
We would like to optimize over the bound in Proposition 6.2.4 by choosing the “best” λ. If we
could choose the optimal λ, by rearranging Proposition 6.2.4 we would obtain the bound
2 1 h 1i
Eπ [P f ] ≤ Eπ [Pn f ] + inf λEπ [σ (f )] + Dkl (π||π0 ) + log
λ>0 nλ δ
r
Eπ [σ 2 (f )] h 1 i
= Eπ [Pn f ] + 2 Dkl (π||π0 ) + log
n δ
simultaneously for all π, with probability at least 1−δ. The problem with this approach is two-fold:
first, we cannot arbitrarily choose λ in Proposition 6.2.4, and second, the bound above depends on
the unknown population variance σ 2 (f ). It is thus of interest to understand situations in which
we can obtain similar guarantees, but where we can replace unknown population quantities on the
right side of the bound with known quantities.
To that end, let us consider the following condition, a type of relative error condition related
to the Bernstein condition (4.1.7): for each f ∈ F,
σ 2 (f ) ≤ bP f. (6.2.5)
This condition is most natural when each of the functions f take nonnegative values—for example,
when f (X) = ℓ(θ, X) for some loss function ℓ and parameter θ of a model. If the functions f are
nonnegative and upper bounded by b, then we certainly have σ 2 (f ) ≤ E[f (X)2 ] ≤ bE[f (X)] = bP f ,
so that Condition (6.2.5) holds. Revisiting Proposition 6.2.4, we rearrange to obtain the following
theorem.
Theorem 6.2.5. Let F be a collection of functions satisfying the Bernstein condition (6.2.3) as in
Proposition 6.2.4, and in addition, assume the variance-bounding condition (6.2.5). Then for any
1
0 ≤ λ ≤ 2b , with probability at least 1 − δ,
λb 1 1h 1i
Eπ [P f ] ≤ Eπ [Pn f ] + Eπ [Pn f ] + Dkl (π||π0 ) + log
1 − λb λ(1 − λb) n δ
for all π.
138
Lexture Notes on Statistics and Information Theory John Duchi
apply Proposition 6.2.4, and divide both sides of the resulting inequality by λ(1 − λb).
To make this uniform in λ, thus achieving a tighter bound (so that we need not pre-select λ),
1 λb
we choose multiple values of λ and apply a union bound. To that end, let 1 + η = 1−λb , or η = 1−λb
1 (1+η)2
and λb(1−λb) = η , so that the inequality in Theorem 6.2.1 is equivalent to
(1 + η)2 b h 1i
Eπ [P f ] ≤ Eπ [Pn f ] + ηEπ [Pn f ] + Dkl (π||π0 ) + log .
η n δ
Using that our choice of η ∈ [0, 1], this implies
1 bh 1 i 3b h 1i
Eπ [P f ] ≤ Eπ [Pn f ] + ηEπ [Pn f ] + Dkl (π||π0 ) + log + Dkl (π||π0 ) + log .
ηn δ n δ
Now, take η1 = 1/n, . . . , ηn = 1. Then by optimizing over η ∈ {η1 , . . . , ηn } (which is equivalent, to
within a 1/n factor, to optimizing over 0 < η ≤ 1) and applying a union bound, we obtain
Corollary 6.2.6. Let the conditions of Theorem 6.2.5 hold. Then with probability at least 1 − δ,
r
bEπ [Pn f ] h ni 1 h n i
Eπ [P f ] ≤ Eπ [Pn f ] + 2 Dkl (π||π0 ) + log + Eπ [Pn f ] + 5b Dkl (π||π0 ) + log ,
n δ n δ
simultaneously for all π on F.
Proof By a union bound, we have
1 bh n i 3b h ni
Eπ [P f ] ≤ Eπ [Pn f ] + ηEπ [Pn f ] + Dkl (π||π0 ) + log + Dkl (π||π0 ) + log
ηn δ n δ
for each η ∈ {1/n, . . . , 1}. We consider two cases. In the first, assume that Eπ [Pn f ] ≤ nb (Dkl (π||π0 )+
log nδ . Then taking η = 1 above evidently gives the result. In the second, we have Eπ [Pn f ] >
b n
n (Dkl (π||π0 ) + log δ ), and we can set
s
b n
n (Dkl (π||π0 ) + log δ )
η⋆ = ∈ (0, 1).
Eπ [Pn f ]
1
Choosing η to be the smallest value ηk in {η1 , . . . , ηn } with ηk ≥ η⋆ , so that η⋆ ≤ η ≤ η⋆ + n then
implies the claim in the corollary.
139
Lexture Notes on Statistics and Information Theory John Duchi
We call the quantity ⟨θ, x⟩y the margin of θ on the pair (x, y), noting that when the margin is
large, ⟨θ, x⟩ has the same sign as y and is “confident” (i.e. far from zero). For shorthand, let us
define the expected and empirical losses at margin γ by
Consider the following scenario: the data x lie in a ball of radius b, so that ∥x∥2 ≤ b; note that
the losses ℓγ and ℓ0 satisfy the Bernstein (6.2.3) and self-bounding (6.2.5) conditions with constant
1 as they take values in {0, 1}. We then have the following proposition.
Proposition 6.2.7. Let the above conditions on the data (x, y) hold and let the margin γ > 0 and
radius r < ∞. Then with probability at least 1 − δ,
√ rb log nδ p r2 b2 log nδ
1
P (⟨θ, X⟩Y ≤ 0) ≤ 1 + Pn (⟨θ, X⟩Y ≤ γ) + 8 √ Pn (⟨θ, X⟩Y ≤ γ) + C
n γ n γ2n
simultaneously for all ∥θ∥2 ≤ r, where C is a numerical constant independent of the problem
parameters.
Proposition 6.2.7 provides a “dimension-free” guarantee—it depends only on the ℓ2 -norms ∥θ∥2
and ∥x∥2 —so that it can apply equally in infinite dimensional spaces. The key to the inequality
is that if we can find a large margin predictor—for example, one achieved by a support vector
machine or, more broadly, by minimizing a convex loss of the form
n
1X
minimize ϕ(⟨Xi , θ⟩Yi )
∥θ∥2 ≤r n
i=1
for some decreasing convex ϕ : R → R+ , e.g. ϕ(t) = [1 − t]+ or ϕ(t) = log(1 + e−t )—then we get
strong generalization performance guarantees relative to the empirical margin γ. As one particular
instantiation of this approach, suppose we can obtain a perfect classifier with positive margin: a
vector θ with ∥θ∥2 ≤ r such that ⟨θ, Xi ⟩Yi ≥ γ for each i = 1, . . . , n. Then Proposition 6.2.7
guarantees that
r2 b2 log nδ
P (⟨θ, X⟩Y ≤ 0) ≤ C
γ2n
with probability at least 1 − δ.
Proof Let π0 be N(0, τ 2 I) for some τ > 0 to be chosen, and let π be N(θ,
b τ 2 I) for some θb ∈ Rd
satisfying ∥θ∥
b 2 ≤ r. Then Corollary 6.2.6 implies that
Eπ [Lγ (θ)]
s
Eπ [L
b γ (θ)] h ni 1 b γ (θ)] + C Dkl (π||π0 ) + log n
h i
≤ Eπ [L
b γ (θ)] + 2 Dkl (π||π0 ) + log + Eπ [L
n δ n δ
s
b γ (θ)] h r2
Eπ [L ni 1
h r2 ni
≤ Eπ [Lγ (θ)] + 2
b + log + Eπ [Lγ (θ)] + C
b + log
n 2τ 2 δ n 2τ 2 δ
140
Lexture Notes on Statistics and Information Theory John Duchi
Let us use the margin assumption. Note that if Z ∼ N(0, τ 2 I), then for any fixed θ0 , x, y we
have
ℓ0 (θ0 ; (x, y)) − P(Z ⊤ x ≥ γ) ≤ E[ℓγ (θ0 + Z; (x, y))] ≤ ℓ2γ (θ0 ; (x, y)) + P(Z ⊤ x ≥ γ)
where the middle expectation is over Z ∼ N(0, τ 2 I). Using the τ 2 ∥x∥22 -sub-Gaussianity of Z ⊤ x, we
can obtain immediately that if ∥x∥2 ≤ b, we have
γ2 γ2
ℓ0 (θ0 ; (x, y)) − exp − 2 2 ≤ E[ℓγ (θ0 + Z; (x, y))] ≤ ℓ2γ (θ0 ; (x, y)) + exp − 2 2 .
2τ b 2τ b
Returning to our earlier bound, we evidently have that if ∥x∥2 ≤ b for all x ∈ X , then with
probability at least 1 − δ, simultaneously for all θ ∈ Rd with ∥θ∥2 ≤ r,
s
γ 2
b 2γ (θ) + exp(− γ22 2 ) h r2
L ni
2τ b
L0 (θ) ≤ L2γ (θ) + 2 exp − 2 2 + 2
b + log
2τ b n 2τ 2 δ
2 2
1 b γ h r n i
+ L2γ (θ) + exp − 2 2 + C + log .
n 2τ b 2τ 2 δ
2
Setting τ 2 = 2b2γlog n , we immediately see that for any choice of margin γ > 0, we have with
probability at least 1 − δ that
s
2b 1 hb b ih r2 b2 log n ni
L0 (θ) ≤ Lb 2γ (θ) + +2 L2γ (θ) + + log
n n n 2γ 2 δ
2 2
1 b 1 h r b log n n i
+ L2γ (θ) + + C 2
+ log
n n 2γ δ
for all ∥θ∥2 ≤ r.
Rewriting (replacing 2γ with γ) and recognizing that with no loss of generality we may take γ
such that rb ≥ γ gives the claim of the proposition.
141
Lexture Notes on Statistics and Information Theory John Duchi
Corollary 6.2.8. Let F be chosen according to any distribution π(· | X1n ) conditional on the sample
iid
X1n . Then with probability at least 1 − δ0 − δ1 over the sample X1n ∼ P ,
8σ 2 I(F ; X1n )
2
E[(Pn F − P F )2 | X1n ] ≤ + log .
3n δ0 δ1
This corollary shows that if we have any procedure—say, a learning procedure or otherwise—
that limits the information between a sample X1n and an output F , then we are guaranteed that
F generalizes. Tighter analyses of this are possible, though not our focus here, just that already
there should be an inkling that limiting information between input samples and outputs may be
fruitful.
142
Lexture Notes on Statistics and Information Theory John Duchi
of forking paths,” as Gelman and Loken [100] term it, causes challenges even when researchers are
not “p-hacking” or going on a “fishing expedition” to try to find publishable results. The problem
in these studies and approaches is that, because we make decisions that may, even only in a small
way, depend on the data observed, we have invalidated all classical statistical analyses.
To that end, we now consider interactive data analyses, where we perform data analyses se-
quentially, computing new functions on a fixed sample X1 , . . . , Xn after observing some initial
information about the sample. The starting point of our approach is similar to our analysis of
PAC-Bayesian learning and generalization: we observe that if the function we decide to compute
on the data X1n is chosen without much information about the data at hand, then its value on the
sample should be similar to its values on the full population. This insight dovetails with what we
have seen thus far, that appropriate “stability” in information can be useful and guarantee good
future performance.
by Corollary 4.1.10 (sub-Gaussian concentration) and a union bound. Thus, so long as |Φ| is not
exponential in the sample size n, we expect uniformly high accuracy.
Example 6.3.1 (Risk minimization via statistical queries): Suppose that we are in the loss-
minimization setting (5.2.2), where the losses ℓ(θ, Xi ) are convex and differentiable in θ. Then
gradient descent applied to Lb n (θ) = Pn ℓ(θ, X) will converge to a minimizing value of Lb n . We
can evidently implement gradient descent by a sequence of statistical queries ϕ(x) = ∇θ ℓ(θ, x),
iterating
θ(k+1 ) = θ(k) − αk Pn ϕ(k) , (6.3.2)
where ϕ(k) = ∇θ ℓ(θ(k) , x) and αk is a stepsize. 3
One issue with the example (6.3.1) is that we are interacting with the dataset, because each
sequential query ϕ(k) depends on the previous k − 1 queries. (Our results on uniform convergence
of empirical functionals and related ideas address many of these challenges, so that the result of
the process (6.3.2) will be well-behaved regardless of the interactivity.)
143
Lexture Notes on Statistics and Information Theory John Duchi
We consider an interactive version of the statistical query estimation problem. In this version,
there are two parties: an analyst (or statistician or learner), who issues queries ϕ : X → R, and
a mechanism that answers the queries to the analyst. We index our functionals ϕ by t ∈ T for a
(possibly infinite) set T , so we have a collection {ϕt }t∈T . In this context, we thus have the following
scheme:
Input: Sample X1n drawn i.i.d. P , collection {ϕt }t∈T of possible queries
Repeat: for k = 1, 2, . . .
Of interest in the iteration 6.1 is that we interactively choose T1 , T2 , . . . , Tk , where the choice Ti
may depend on our approximations of EP [ϕTj (X)] for j < i, that is, on the results of our previous
queries. Even more broadly, the analyst may be able to choose the index Tk in alternative ways
depending on the sample X1n , and our goal is to still be able to accurately compute expectations
P ϕT = EP [ϕT (X)] when the index T may depend on X1n . The setting in Figure 6.1 clearly breaks
with the classical statistical setting in which an analysis is pre-specified before collecting data, but
more closely captures modern data exploration practices.
144
Lexture Notes on Statistics and Information Theory John Duchi
for all distributions π on T , which may depend on Pn , where the expectation E is taken over the
iid
sample X1n ∼ P . (Here inequality (i) is Theorem 6.1.1, inequality (ii) is Jensen’s inequality, and
inequality (iii) is Lemma 6.2.2.) Now, let π0 be the marginal distribution on T (marginally over
all observations X1n ), and let π denote the posterior of T conditional on the sample X1n . Then
E[Dkl (π||π0 )] = I(X1n ; T ) by definition of the mutual information, giving the bound on the squared
error.
For the second result, note that the Donsker-Varadhan equality implies
λ2 σ 2
Z Z
λE Pn ϕt dπ(t) ≤ E[Dkl (π||π0 )] + log E[exp(λPn ϕt )]dπ0 (t) ≤ I(X1n ; T ) + .
2n
p
Dividing both sides by λ gives E[Pn ϕT ] ≤ 2σ 2 I(X1n ; T )/n, and performing the same analysis with
−ϕT gives the second result of the theorem.
The key in the theorem is that if the mutual information—the Shannon information—I(X; T )
between the sample X and T is small, then the expected squared error can be small. To make this
n
a bit clearer, let us choose values for λ in the theorem; taking λ = 2eσ 2 gives the following corollary.
2eσ 2 5σ 2
E[(Pn ϕT − P ϕT )2 ] ≤ I(X1n ; T ) + .
n 4n
Consequently, if we can limit the amount of information any particular query T (i.e., ϕT ) contains
about the actual sample X1n , then guarantee reasonably high accuracy in the second moment errors
(Pn ϕT − P ϕT )2 .
Example 6.3.4 (A stylized correlation analysis): Consider the following stylized genetics
experiment. We observe vectors X ∈ {−1, 1}k , where Xj = 1 if gene j is expressed and −1
otherwise. We also observe phenotypes Y ∈ {−1, 1}, where Y = 1 indicates appearance of
the phenotype. In our setting, we will assume that the vectors X are uniform on {−1, 1}k
and independent of Y , but an experimentalist friend of ours wishes to know if there exists a
vector v with ∥v∥2 = 1 such that the correlation between v T X and Y is high, meaning that
v T X is associated with Y . In our notation here, we have index set {v ∈ Rk | ∥v∥2 = 1}, and
by Example 4.1.6, Hoeffding’s lemma, and the independence of the coordinates of X we have
that v T XY is ∥v∥22 /4 = 1/4-sub-Gaussian. Now, we recall the fact that if Zj , j = 1, . . . , k, are
σ 2 -sub-Gaussian, then for any p ≥ 1, we have
145
Lexture Notes on Statistics and Information Theory John Duchi
for a numerical constant C. That is, powers of sub-Gaussian maxima grow at most logarith-
mically. Indeed, by Theorem 4.1.11, we have for any q ≥ 1 by Hölder’s inequality that
X 1/q
p pq
E[max |Zj | ] ≤ E |Zj | ≤ k 1/q (Cpqσ 2 )p/2 ,
j
j
and setting q = log k gives the inequality. Thus, we see that for any a priori fixed v1 , . . . , vk , vk+1 ,
we have
log k
E[max(vjT (Pn Y X))2 ] ≤ O(1) .
j n
If instead we allow a single interaction, the problem is different. We issue queries
associated with v = e1 , . . . , ek , the k standard basis vectors; then we simply set Vk+1 =
Pn Y X/ ∥Pn Y X∥2 . Then evidently
k
T
E[(Vk+1 (Pn Y X))2 ] = E[∥Pn Y X∥22 ] = ,
n
which is exponentially larger than in the non-interactive case. That is, if an analyst is allowed
to interact with the dataset, he or she may be able to discover very large correlations that are
certainly false in the population, which in this case has P XY = 0. 3
Example 6.3.4 shows that, without being a little careful, substantial issues may arise in interac-
tive data analysis scenarios. When we consider our goal more broadly, which is to be able to provide
accurate approximations to P ϕ for queries ϕ chosen adaptively for any population distribution P
and ϕ : X → [−1, 1], it is possible to construct quite perverse situations, where if we compute
sample expectations Pn ϕ exactly, one round of interaction is sufficient to find a query ϕ for which
Pn ϕ − P ϕ ≥ 1.
Example 6.3.5 (Exact query answering allows arbitrary corruption): Suppose we draw a
iid
sample X1n of size n on a sample space X = [m] with Xi ∼ Uniform([m]), where m ≥ 2n. Let
Φ be the collection of all functions ϕ : [m] → [−1, 1], so that P(|Pn ϕ − P ϕ| ≥ t) ≤ exp(−nt2 /2)
for any fixed ϕ. Suppose that in the interactive scheme in Fig. 6.1, we simply release answers
A = Pn ϕ. Consider the following query:
146
Lexture Notes on Statistics and Information Theory John Duchi
More generally, when one performs an interactive data analysis (e.g. as in Fig. 6.1), adapting
hypotheses while interacting with a dataset, it is not a question of statistical significance or mul-
tiplicity control for the analysis one does, but for all the possible analyses one might have done
otherwise. Given the branching paths one might take in an analysis, it is clear that we require
some care.
With that in mind, we consider the desiderata for techniques we might use to control information
in the indices we select. We seek some type of stability in the information algorithms provide
to a data analyst—intuitively, if small changes to a sample do not change the behavior of an
analyst substantially, then we expect to obtain reasonable generalization bounds. If outputs of a
particular analysis procedure carry little information about a particular sample (but instead provide
information about a population), then Corollary 6.3.3 suggests that any estimates we obtain should
be accurate.
To develop this stability theory, we require two conditions: first, that whatever quantity we
develop for stability should compose adaptively, meaning that if we apply two (randomized) algo-
rithms to a sample, then if both are appropriately stable, even if we choose the second algorithm
because of the output of the first in arbitrary ways, they should remain jointly stable. Second, our
notion should bound the mutual information I(X1n ; T ) between the sample X1n and T . Lastly, we
remark that this control on the mutual information has an additional benefit: by the data process-
ing inequality, any downstream analysis we perform that depends only on T necessarily satisfies the
same stability and information guarantees as T , because if we have the Markov chain X1n → T → V
then I(X1n ; V ) ≤ I(X1n ; T ).
We consider randomized algorithms A : X n → A, taking values in our index set A, where
A(X1n ) ∈ A is a random variable that depends on the sample X1n . For simplicity in derivation,
we abuse notation in this section, and for random variables X and Y with distributions P and Q
respectively, we denote
Dkl (X||Y ) := Dkl (P ||Q) .
We then ask for a type of leave-one-out stability for the algorithms A, where A is insensitive to the
changes of a single example (on average).
Definition 6.1. Let ε ≥ 0. A randomized algorithm A : X n → A is ε-KL-stable if for each
i ∈ {1, . . . , n} there is a randomized Ai : X n−1 → A such that for every sample xn1 ∈ X n ,
n
1X
Dkl A(xn1 )||Ai (x\i ) ≤ ε.
n
i=1
n n
1X n
1 X 1 2 1
Dkl A(x1 )||A(x\i ) = x ≤ 2 2,
n 2nσ 2 n2 i 2σ n
i=1 i=1
so that a the sample mean of a bounded random variable perturbed with Guassian noise is
ε = 2σ12 n2 -KL-stable. 3
147
Lexture Notes on Statistics and Information Theory John Duchi
Example 6.3.7 (KL-stability in mean estimation: Laplace noise addition): Let the conditions
of Example 2.1.7 hold,
P but suppose instead of Gaussian noise we add scaled Laplace noise,
that is, A(xn1 ) = n1 ni=1 xi + Z for Z with density p(z) = 2σ
1
exp(−|z|/σ), where σ > 0. Then
using that if Lµ,σ denotes the Laplace distribution with shape σ and mean µ, with density
1
p(z) = 2σ exp(−|z − µ|/σ), we have
Z |µ1 −µ0 |
1
Dkl (Lµ0 ,σ ||Lµ1 ,σ ) = exp(−z/σ)(|µ1 − µ0 | − z)dz
σ2 0
|µ1 − µ0 |2
|µ1 − µ0 | |µ1 − µ0 |
= exp − −1+ ≤ ,
σ σ 2σ 2
we see that in this case the sample mean of a bounded random variable perturbed with Laplace
noise is ε = 2σ12 n2 -KL-stable, where σ is the shape parameter. 3
The two key facts are that KL-stable algorithms compose adaptively and that they bound
mutual information in independent samples.
Proof Let Ai and A′i be the promised sub-algorithms in Definition 6.1. We apply the data
processing inequality, which implies for each i that
Dkl A′ (A(xn1 ), xn1 )||A′i (Ai (x\i ), x\i ) ≤ Dkl A′ (A(xn1 ), xn1 ), A(xn1 )||A′i (Ai (x\i ), x\i ), Ai (x\i ) .
We require a bit of notational trickery now. Fixing i, let PA,A′ be the joint distribution of
A′ (A(xn1 ), xn1 ) and A(xn1 ) and QA,A′ the joint distribution of A′i (Ai (x\i ), x\i ) and Ai (x\i ), so that
they are both distributions over A1 × A0 . Let PA′ |a be the distribution of A′ (t, xn1 ) and similarly
QA′ |a is the distribution of A′i (t, x\i ). Note that A′ , A′i both “observe” x, so that using the chain
rule (2.1.6) for KL-divergences, we have
as desired.
The second key result is that KL-stable algorithms also bound the mutual information of a
random function.
148
Lexture Notes on Statistics and Information Theory John Duchi
Proof Without loss of generality, we assume A and X are both discrete. In this case, we have
n
X n
X
I(A; X1n ) = I(A; Xi | X1i−1 ) = H(Xi | X1i−1 ) − H(Xi | A, X1i−1 ).
i=1 i=1
Now, because the Xi follow a product distribution, H(Xi | X1i−1 ) = H(Xi ), while H(Xi |
A, X1i−1 ) ≥ H(Xi | A, X\i ) because conditioning reduces entropy. Consequently, we have
n
X n
X
I(A; X1n ) ≤ H(Xi ) − H(Xi | A, X\i ) = I(A; Xi | X\i ).
i=1 i=1
Combining Lemmas 6.3.8 and 6.3.9, we see (nearly) immediately that KL stability implies
a mutual information bound, and consequently even interactive KL-stable algorithms maintain
bounds on mutual information.
Fix an index j and for shorthand, let A = A and A′ = (A1 , . . . , Aj−1 ) be the first j − 1 procedures.
Then expanding the final mutual information term and letting ν denote the distribution of A′ , we
have
Z
I(A; Xi | X\i , A ) = Dkl A(a′ , xn1 )||A(a′ , x\i ) dP (xi | A′ = a′ , x\i )dP n−1 (x\i )dν(a′ | x\i )
′
149
Lexture Notes on Statistics and Information Theory John Duchi
where A(a′ , xn1 ) is the (random) procedure A on inputs xn1 and a′ , while A(a′ , x\i ) denotes the
(random) procedure A on input a′ , x\i , Xi , and where the ith example Xi follows its disdtribution
conditional on A′ = a′ and X\i = x\i , as in Lemma 6.3.9. We then recognize that for each i, we
have
Z Z
′ n ′ ′
Dkl A(a , x1 )||A(a , x\i ) dP (xi | a , x\i ) ≤ Dkl A(a′ , xn1 )||A(a
e ′ , x\i ) dP (xi | a′ , x\i )
for any randomized function A, e as the marginal A in the lemma minimizes the average KL-
divergence (recall Exercise 2.15). Now, sum over i and apply the definition of KL-stability as
in Lemma 6.3.8.
Input: Sample X1n ∈ X n drawn i.i.d. P , collection {ϕt }t∈T of possible queries ϕt : X →
[−1, 1]
Repeat: for k = 1, 2, . . .
This procedure is evidently KL-stable, and based on Example 6.3.6 and Proposition 6.3.10, we
have that
1 k
I(X1n ; T1 , . . . , Tk , Tk+1 ) ≤ 2 2
n 2σ n
so long as the indices Ti ∈ T are chosen only as functions of Pn ϕ + Zj for j < i, as the classical
information processing inequality implies that
1 1
I(X1n ; T1 , . . . , Tk , Tk+1 ) ≤ I(X1n ; A1 , . . . , Ak )
n n
because we have X1n → A1 → T2 and so on for the remaining indices. With this, we obtain the
following theorem.
150
Lexture Notes on Statistics and Information Theory John Duchi
Theorem 6.3.11. Let the indices Ti , i = 1, . . . , k + 1 be chosen in an arbitrary way using the
procedure 6.2, and let σ 2 > 0. Then
2 2ek 10
E max(Aj − P ϕTj ) ≤ 2 2 + + 4σ 2 (log k + 1).
j≤k σ n 4n
p
By inspection, we can optimize over σ 2 by setting σ 2 = k/(log k + 1)/n, which yields the
upper bound p
2 10 k(1 + log k)
E max(Aj − P ϕTj ) ≤ + 10 .
j≤k 4n n
Comparing to Example 6.3.4, we see a substantial improvement. While we do not achieve accuracy
scaling with log k, as we would if the queried functionals ϕt were completely independent of the
sample, we see that we achieve mean-squared error of order
√
k log k
n
for k adaptively chosen queries.
Proof To prove the result, we use a technique sometimes called the monitor technique. Roughly,
the idea is that we can choose the index Tk+1 in any way we desire as long as it is a function of the
answers A1 , . . . , Ak and any other constants independent of the data. Thus, we may choose
Tk+1 := Tk⋆ where k ⋆ = argmax{|Aj − P ϕTj |},
j≤k
151
Lexture Notes on Statistics and Information Theory John Duchi
√
Setting t0 = 2 log(4k/ 2π) gives E[maxj Wj2 ] ≤ 2 log k + log √42π + 1.