Lecture Notes
Lecture Notes
John Duchi
December 6, 2023
Contents
1
Lexture Notes on Statistics and Information Theory John Duchi
4 Concentration Inequalities 62
4.1 Basic tail inequalities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
4.1.1 Sub-Gaussian random variables . . . . . . . . . . . . . . . . . . . . . . . . . . 64
4.1.2 Sub-exponential random variables . . . . . . . . . . . . . . . . . . . . . . . . 68
4.1.3 Orlicz norms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
4.1.4 First applications of concentration: random projections . . . . . . . . . . . . 73
4.1.5 A second application of concentration: codebook generation . . . . . . . . . . 75
4.2 Martingale methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
4.2.1 Sub-Gaussian martingales and Azuma-Hoeffding inequalities . . . . . . . . . 78
4.2.2 Examples and bounded differences . . . . . . . . . . . . . . . . . . . . . . . . 79
4.3 Uniformity and metric entropy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
4.3.1 Symmetrization and uniform laws . . . . . . . . . . . . . . . . . . . . . . . . 82
4.3.2 Metric entropy, coverings, and packings . . . . . . . . . . . . . . . . . . . . . 86
4.4 Generalization bounds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
4.4.1 Finite and countable classes of functions . . . . . . . . . . . . . . . . . . . . . 91
4.4.2 Large classes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
4.4.3 Structural risk minimization and adaptivity . . . . . . . . . . . . . . . . . . . 95
4.5 Technical proofs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
4.5.1 Proof of Theorem 4.1.11 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
4.5.2 Proof of Theorem 4.1.15 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
4.5.3 Proof of Theorem 4.3.6 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
4.6 Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
4.7 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
2
Lexture Notes on Statistics and Information Theory John Duchi
8 Minimax lower bounds: the Le Cam, Fano, and Assouad methods 178
8.1 Basic framework and minimax risk . . . . . . . . . . . . . . . . . . . . . . . . . . . . 178
8.2 Preliminaries on methods for lower bounds . . . . . . . . . . . . . . . . . . . . . . . 180
8.2.1 From estimation to testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181
8.2.2 Inequalities between divergences and product distributions . . . . . . . . . . 182
8.2.3 Metric entropy and packing numbers . . . . . . . . . . . . . . . . . . . . . . . 184
8.3 Le Cam’s method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 185
8.4 Fano’s method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 187
8.4.1 The classical (local) Fano method . . . . . . . . . . . . . . . . . . . . . . . . 187
8.4.2 A distance-based Fano method . . . . . . . . . . . . . . . . . . . . . . . . . . 192
8.5 Assouad’s method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 195
8.5.1 Well-separated problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 195
8.5.2 From estimation to multiple binary tests . . . . . . . . . . . . . . . . . . . . . 195
8.5.3 Example applications of Assouad’s method . . . . . . . . . . . . . . . . . . . 197
8.6 Nonparametric regression: minimax upper and lower bounds . . . . . . . . . . . . . 199
8.6.1 Kernel estimates of the function . . . . . . . . . . . . . . . . . . . . . . . . . 199
8.6.2 Minimax lower bounds on estimation with Assouad’s method . . . . . . . . . 203
8.7 Global Fano Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 206
8.7.1 A mutual information bound based on metric entropy . . . . . . . . . . . . . 206
3
Lexture Notes on Statistics and Information Theory John Duchi
4
Lexture Notes on Statistics and Information Theory John Duchi
5
Lexture Notes on Statistics and Information Theory John Duchi
6
Lexture Notes on Statistics and Information Theory John Duchi
V Appendices 437
7
Chapter 1
This set of lecture notes explores some of the (many) connections relating information theory,
statistics, computation, and learning. Signal processing, machine learning, and statistics all revolve
around extracting useful information from signals and data. In signal processing and information
theory, a central question is how to best design signals—and the channels over which they are
transmitted—to maximally communicate and store information, and to allow the most effective
decoding. In machine learning and statistics, by contrast, it is often the case that there is a
fixed data distribution that nature provides, and it is the learner’s or statistician’s goal to recover
information about this (unknown) distribution.
A central aspect of information theory is the discovery of fundamental results: results that
demonstrate that certain procedures are optimal. That is, information theoretic tools allow a
characterization of the attainable results in a variety of communication and statistical settings. As
we explore in these notes in the context of statistical, inferential, and machine learning tasks, this
allows us to develop procedures whose optimality we can certify—no better procedure is possible.
Such results are useful for a myriad of reasons; we would like to avoid making bad decisions or false
inferences, we may realize a task is impossible, and we can explicitly calculate the amount of data
necessary for solving different statistical problems.
In this context, we provide two main high-level examples, one for each of these tasks.
Example 1.1.1 (Source coding): The source coding, or data compression problem, is to
take information from a source, compress it, decompress it, and recover the original message.
Graphically, we have
8
Lexture Notes on Statistics and Information Theory John Duchi
The question, then, is how to design a compressor (encoder) and decompressor (decoder) that
uses the fewest number of bits to describe a source (or a message) while preserving all the
information, in the sense that the receiver receives the correct message with high probability.
This fewest number of bits is then the information content of the source (signal). 3
Example 1.1.2: The channel coding, or data transmission problem, is the same as the source
coding problem of Example 1.1.1, except that between the compressor and decompressor is a
source of noise, a channel. In this case, the graphical representation is
Here the question is the maximum number of bits that may be sent per each channel use in
the sense that the receiver may reconstruct the desired message with low probability of error.
Because the channel introduces noise, we require some redundancy, and information theory
studies the exact amount of redundancy and number of bits that must be sent to allow such
reconstruction. 3
Here, we estimate Pb—an empirical version of the distribution P that is easier to describe than
the original signal X1 , . . . , Xn , with the hope that we learn information about the generating
distribution P , or at least describe it efficiently.
In our analogy with channel coding, we make a connection with estimation and inference.
Roughly, the major problem in statistics we consider is as follows: there exists some unknown
function f on a space X that we wish to estimate, and we are able to observe a noisy version
of f (Xi ) for a series of Xi drawn from a distribution P . Recalling the graphical description of
Example 1.1.2, we now have a channel P (Y | f (X)) that gives us noisy observations of f (X) for
each Xi , but we may (generally) now longer choose the encoder/compressor. That is, we have
X1 ,...,Xn f (X1 ),...,f (Xn ) Y1 ,...,Yn
Source (P ) −→ Compressor −→ Channel P (Y | f (X)) −→ Decompressor.
9
Lexture Notes on Statistics and Information Theory John Duchi
Example 1.2.1: A classical example of the statistical paradigm in this lens is the usual linear
regression problem. Here the data Xi belong to Rd , and the compression function f (x) = θ> x
for some vector θ ∈ Rd . Then the channel is often of the form
Yi = θ> Xi + εi ,
| {z } |{z}
signal noise
iid
where εi ∼ N(0, σ 2 ) are independent mean zero normal perturbations. The goal is, given a
sequence of pairs (Xi , Yi ), to recover the true θ in the linear model.
In active learning or active sensing scenarios, also known as (sequential) experimental design,
we may choose the sequence Xi so as to better explore properties of θ. Later in the course we
will investigate whether it is possible to improve estimation by these strategies. As one concrete
idea, if we allow infinite power, which in this context corresponds to letting kXi k → ∞—
choosing very “large” vectors xi —then the signal of θ> Xi should swamp any noise and make
estimation easier. 3
For the remainder of the class, we explore these ideas in substantially more detail.
10
Lexture Notes on Statistics and Information Theory John Duchi
Part I of the notes covers what I term “stability” based results. At a high level, this means that
we ask what can be gained by considering situations where individual observations in a sequence
of random variables X1 , . . . , Xn have little effect on various functions of the sequence. We begin
in Chapter 4 with basic concentration inequalities, discussing how sums and related quantities can
converge quickly; while this material is essential for the remainder of the lectures, it does not depend
on particular information-theoretic techniques. We discuss some heuristic applications to problems
in statistical learning—empirical risk minimization—in this section of the notes. We provide a
treatment of more advanced ideas in Chapter 6, including some approaches to concentration via
entropy methods. We then turn in Chapter 5 carefully investigate generalization and convergence
guarantees—arguing that functions of a sample X1 , . . . , Xn are representative of the full population
P from which the sample is drawn—based on controlling different information-theoretic quantities.
In this context, we develop PAC-Bayesian bounds, and we also use the same framework to present
tools to control generalization and convergence in interactive data analyses. These types of analyses
reflect modern statistics, where one performs some type of data exploration before committing to a
fuller analysis, but which breaks classical statistical approaches, because the analysis now depends
on the sample. Finally, we provide a chapter (Chapter 7) on disclosure limitation and privacy
techniques, all of which repose on different notions of stability in distribution.
Part II studies fundamental limits, using information-theoretic techniques to derive lower bounds
on the possible rates of convergence for various estimation, learning, and other statistical problems.
Part III revisits all of our information theoretic notions from Chapter 2, but instead of sim-
ply giving definitions and a few consequences, provides operational interpretations of the different
information-theoretic quantities, such as entropy. Of course this includes Shannon’s original results
on the relationship between coding and entropy (Chapter 2.4.1), but we also provide an interpreta-
tion of entropy and information as measures of uncertainty in statistical experiments and statistical
learning, which is a perspective typically missing from information-theoretic treatments of entropy
(Chapters TBD). We also relate these ideas to game-playing and maximum likelihood estimation.
Finally, we relate generic divergence measures to questions of optimality and consistency in statisti-
cal and machine learning problems, which allows us to delineate when (at least in asymptotic senses)
it is possible to computationally efficiently learn good predictors and design good experiments.
11
Chapter 2
In this first introductory chapter, we discuss and review many of the basic concepts of information
theory in effort to introduce them to readers unfamiliar with the tools. Our presentation is relatively
brisk, as our main goal is to get to the meat of the chapters on applications of the inequalities and
tools we develop, but these provide the starting point for everything in the sequel. One of the
main uses of information theory is to prove what, in an information theorist’s lexicon, are known
as converse results: fundamental limits that guarantee no procedure can improve over a particular
benchmark or baseline. We will give the first of these here to preview more of what is to come,
as these fundamental limits form one of the core connections between statistics and information
theory. The tools of information theory, in addition to their mathematical elegance, also come
with strong operational interpretations: they give quite precise answers and explanations for a
variety of real engineering and statistical phenomena. We will touch on one of these here (the
connection between source coding, or lossless compression, and the Shannon entropy), and much
of the remainder of the book will explore more.
2.1.1 Definitions
Here, we provide the basic definitions of entropy, information, and divergence, assuming the random
variables of interest are discrete or have densities with respect to Lebesgue measure.
12
Lexture Notes on Statistics and Information Theory John Duchi
Entropy: We begin with a central concept in information theory: the entropy. Let P be a distri-
bution on a finite (or countable) set X , and let p denote the probability mass function associated
with P . That is, if X is a random variable distributed according to P , then P (X = x) = p(x). The
entropy of X (or of P ) is defined as
X
H(X) := − p(x) log p(x).
x
Because p(x) ≤ 1 for all x, it is clear that this quantity is positive. We will show later that if X
is finite, the maximum entropy distribution on X is the uniform distribution, setting p(x) = 1/|X |
for all x, which has entropy log(|X |).
Later in the class, we provide a number of operational interpretations of the entropy. The
most common interpretation—which forms the beginning of Shannon’s classical information the-
ory [158]—is via the source-coding theorem. We present Shannon’s source coding theorem in
Section 2.4.1, where we show that if we wish to encode a random variable X, distributed according
to P , with a k-ary string (i.e. each entry of the string takes
P on one of k values), then the minimal
expected length of the encoding is given by H(X) = − x p(x) logk p(x). Moreover, this is achiev-
able (to within a length of at most 1 symbol) by using Huffman codes (among many other types of
codes). As an example of this interpretation, we may consider encoding a random variable X with
equi-probable distribution on m items, which has H(X) = log(m). In base-2, this makes sense: we
simply assign an integer to each item and encode each integer with the natural (binary) integer
encoding of length dlog me.
We can also define the conditional entropy, which is the amount of information left in a random
variable after observing another. In particular, we define
X X
H(X | Y = y) = − p(x | y) log p(x | y) and H(X | Y ) = p(y)H(X | Y = y),
x y
Example 2.1.2 (Bernoulli random variables): Let h2 (p) = −p log p − (1 − p) log(1 − p) denote
the binary entropy, which is the entropy of a Bernoulli(p) random variable. 3
13
Lexture Notes on Statistics and Information Theory John Duchi
Example 2.1.4 (A random variable with infinite entropy): While most “reasonable” discrete
random variables have finite entropy, it is possible to construct distributions with infinite
entropy. Indeed, let X have p.m.f. on {2, 3, . . .} defined by
∞
A −1
X 1
p(k) = 2 where A = < ∞,
k log k k=2
k log2 k
R∞ Rx
the last sum finite as 2 x log1 α x dx < ∞ if and only if α > 1: for α = 1, we have e 1
t log t =
log log x, while for α > 1, we have
d 1
(log x)1−α = (1 − α)
dx x logα x
R∞ 1 1
so that e t logα t dt = e(1−α) . To see that the entropy is infinite, note that
X log A + log k + 2 log log k X log k
H(X) = A 2 ≥A 2 − C = ∞,
k≥2
k log k k≥2
k log k
KL-divergence: Now we define two additional quantities, which are actually much more funda-
mental than entropy: they can always be defined for any distributions and any random variables,
as they measure distance between distributions. Entropy simply makes no sense for non-discrete
random variables, let alone random variables with continuous and discrete components, though it
proves useful for some of our arguments and interpretations.
Before defining these quantities, we recall the definition of a convex function f : Rk → R as any
bowl-shaped function, that is, one satisfying
f (λx + (1 − λ)y) ≤ λf (x) + (1 − λ)f (y) (2.1.1)
for all λ ∈ [0, 1], all x, y. The function f is strictly convex if the convexity inequality (2.1.1) is
strict for λ ∈ (0, 1) and x 6= y. We recall a standard result:
Proposition 2.1.5 (Jensen’s inequality). Let f be convex. Then for any random variable X,
f (E[X]) ≤ E[f (X)].
Moreover, if f is strictly convex, then f (E[X]) < E[f (X)] unless X is constant.
Now we may define and provide a few properties of the KL-divergence. Let P and Q be
distributions defined on a discrete set X . The KL-divergence between them is
X p(x)
Dkl (P ||Q) := p(x) log .
q(x)
x∈X
We observe immediately that Dkl (P ||Q) ≥ 0. To see this, we apply Jensen’s inequality (Propo-
sition 2.1.5) to the function − log and the random variable q(X)/p(X), where X is distributed
according to P :
q(X) q(X)
Dkl (P ||Q) = −E log ≥ − log E
p(X) p(X)
X
q(x)
= − log p(x) = − log(1) = 0.
x
p(x)
14
Lexture Notes on Statistics and Information Theory John Duchi
Moreover, as log is strictly convex, we have Dkl (P ||Q) > 0 unless P = Q. Another consequence of
the positivity of the KL-divergence is that whenever the set X is finite with cardinality |X | < ∞,
for any random variable X supported on X we have H(X) ≤ log |X |. Indeed, letting m = |X |, Q
1
be the uniform distribution on X so that q(x) = m , and X have distribution P on X , we have
X p(x) X
0 ≤ Dkl (P ||Q) = p(x) log = −H(X) − p(x) log q(x) = −H(X) + log m, (2.1.2)
x
q(x) x
so that H(X) ≤ log m. Thus, the uniform distribution has the highest entropy over all distributions
on the set X .
Mutual information: Having defined KL-divergence, we may now describe the information
content between two random variables X and Y . The mutual information I(X; Y ) between X and
Y is the KL-divergence between their joint distribution and their products (marginal) distributions.
More mathematically,
X p(x, y)
I(X; Y ) := p(x, y) log . (2.1.3)
x,y
p(x)p(y)
We can rewrite this in several ways. First, using Bayes’ rule, we have p(x, y)/p(y) = p(x | y), so
X p(x | y)
I(X; Y ) = p(y)p(x | y) log
x,y
p(x)
XX X X
=− p(y)p(x | y) log p(x) + p(y) p(x | y) log p(x | y)
x y y x
= H(X) − H(X | Y ).
Similarly, we have I(X; Y ) = H(Y ) − H(Y | X), so mutual information can be thought of as the
amount of entropy removed (on average) in X by observing Y . We may also think of mutual infor-
mation as measuring the similarity between the joint distribution of X and Y and their distribution
when they are treated as independent.
Comparing the definition (2.1.3) to that for KL-divergence, we see that if PXY is the joint
distribution of X and Y , while PX and PY are their marginal distributions (distributions when X
and Y are treated independently), then
Entropies of continuous random variables For continuous random variables, we may define
an analogue of the entropy known as differential entropy, which for a random variable X with
density p is defined by Z
h(X) := − p(x) log p(x)dx. (2.1.4)
15
Lexture Notes on Statistics and Information Theory John Duchi
Note that the differential entropy may be negative—it is no longer directly a measure of the number
of bits required to describe a random variable X (on average), as was the case for the entropy. We
can similarly define the conditional entropy
Z Z
h(X | Y ) = − p(y) p(x | y) log p(x | y)dxdy.
We remark that the conditional differential entropy of X given Y for Y with arbitrary distribution—
so long as X has a density—is
Z
h(X | Y ) = E − p(x | Y ) log p(x | Y )dx ,
where p(x | y) denotes the conditional density of X when Y = y. The KL divergence between
distributions P and Q with densities p and q becomes
Z
p(x)
Dkl (P ||Q) = p(x) log dx,
q(x)
and similarly, we have the analogues of mutual information as
Z
p(x, y)
I(X; Y ) = p(x, y) log dxdy = h(X) − h(X | Y ) = h(Y ) − h(Y | X).
p(x)p(y)
As we show in the next subsection, we can define the KL-divergence between arbitrary distributions
(and mutual information between arbitrary random variables) more generally without requiring
discrete or continuous distributions. Before investigating these issues, however, we present a few
examples. We also see immediately that for X uniform on a set [a, b], we have h(X) = log(b − a).
Example 2.1.6 (Entropy of normal random variables): The differential entropy (2.1.4) of
a normal random variable is straightforward to compute. Indeed, for X ∼ N(µ, σ 2 ) we have
p(x) = √ 1 2 exp(− 2σ1 2 (x − µ)2 ), so that
2πσ
E[(X − µ)2 ]
Z
1 1 1 2 1 2 1
h(X) = − p(x) log − (x − µ) = log(2πσ ) + = log(2πeσ 2 ).
2 2πσ 2 2σ 2 2 2σ 2 2
For a general multivariate Gaussian, where X ∼ N(µ, Σ) for a vector µ ∈ Rn and Σ 0 with
density p(x) = n/2
1
√ exp(− 21 (x − µ)> Σ−1 (x − µ)), we similarly have
(2π) det(Σ)
1 h i
h(X) = E n log(2π) + log det(Σ) + (X − µ)> Σ−1 (X − µ)
2
n 1 1 n 1
= log(2π) + log det(Σ) + tr(ΣΣ−1 ) = log(2πe) + log det(eΣ).
2 2 2 2 2
3
Continuing our examples with normal distributions, we may compute the divergence between
two multivariate Gaussian distributions:
Example 2.1.7 (Divergence between Gaussian distributions): Let P be the multivariate
normal N(µ1 , Σ), and Q be the multivariate normal distribution with mean µ2 and identical
covariance Σ 0. Then we have that
1
Dkl (P ||Q) = (µ1 − µ2 )> Σ−1 (µ1 − µ2 ). (2.1.5)
2
We leave the computation of the identity (2.1.5) to the reader. 3
16
Lexture Notes on Statistics and Information Theory John Duchi
An interesting consequence of Example 2.1.7 is that if a random vector X has a given covari-
ance Σ ∈ Rn×n , then the multivariate Gaussian with identical covariance has larger differential
entropy. Put another way, differential entropy for random variables with second moments is always
maximized by the Gaussian distribution.
Proposition 2.1.8. Let X be a random vector on Rn with a density, and assume that Cov(X) = Σ.
Then for Z ∼ N(0, Σ), we have
h(X) ≤ h(Z).
Proof Without loss of generality, we assume that X has mean 0. Let P be the distribution of
X with density p, and let Q be multivariate normal with mean 0 and covariance Σ; let Z be this
random variable. Then
Z Z
p(x) n 1 > −1
Dkl (P ||Q) = p(x) log dx = −h(X) + p(x) log(2π) − x Σ x dx
q(x) 2 2
= −h(X) + h(Z),
because Z has the same covariance as X. As 0 ≤ Dkl (P ||Q), we have h(Z) ≥ h(X) as desired.
We remark in passing that the fact that Gaussian random variables have the largest entropy has
been used to prove stronger variants of the central limit theorem; see the original results of Barron
[16], as well as later quantitative results on the increase of entropy of normalized sums by Artstein
et al. [9] and Madiman and Barron [134].
17
Lexture Notes on Statistics and Information Theory John Duchi
Chain rules for information and divergence: As another immediate corollary to the chain
rule for entropy, we see that mutual information also obeys a chain rule:
n
X
I(X; Y1n ) = I(X; Yi | Y1i−1 ).
i=1
Indeed, we have
n
X n
X
I(X; Y1n ) = H(Y1n ) − H(Y1n | X) = H(Yi | Y1i−1 ) − H(Yi | X, Y1i−1 ) = I(X; Yi | Y1i−1 ).
i=1 i=1
The KL-divergence obeys similar chain rules, making mutual information and KL-divergence mea-
sures useful tools for evaluation of distances and relationships between groups of random variables.
As a second example, suppose that the distribution P = P1 ×P2 ×· · ·×Pn , and Q = Q1 ×· · ·×Qn ,
that is, that P and Q are product distributions over independent random variables Xi ∼ Pi or
Xi ∼ Qi . Then we immediately have the tensorization identity
n
X
Dkl (P ||Q) = Dkl (P1 × · · · × Pn ||Q1 × · · · × Qn ) = Dkl (Pi ||Qi ) .
i=1
We remark in passing that these two identities hold for arbitrary distributions Pi and Qi or random
variables X, Y . As a final tensorization identiy, we consider a more general chain rule for KL-
divergences, which will frequently be useful. We abuse notation temporarily, and for random
variables X and Y with distributions P and Q, respectively, we denote
In analogy to the entropy, we can also define the conditional KL divergence. Let X and Y have
distributions PX|z and PY |z conditioned on Z = z, respectively. Then we define
Dkl (X||Y | Z) = EZ [Dkl PX|Z ||PY |Z ],
P
so that if Z is discrete we have Dkl (X||Y | Z) = z p(z)Dkl PX|z ||PY |z . With this notation, we
have the chain rule
n
X
Dkl Xi ||Yi | X1i−1 ,
Dkl (X1 , . . . , Xn ||Y1 , . . . , Yn ) = (2.1.6)
i=1
because (in the discrete case, which—as we discuss presently—is fully general for this purpose) for
distributions PXY and QXY we have
X p(x, y) X p(y | x) p(x)
Dkl (PXY ||QXY ) = p(x, y) log = p(x)p(y | x) log + log
x,y
q(x, y) x,y
q(y | x) q(x)
X p(x) X X p(y | x)
= p(x) log + p(x) p(y | x) log ,
x
q(x) x y
q(y | x)
P
where the final equality uses that y p(y | x) = 1 for all x.
Expanding upon this, we give several tensorization identities, showing how to transform ques-
tions about the joint distribution of many random variables to simpler questions about their
18
Lexture Notes on Statistics and Information Theory John Duchi
marginals. As a first example, we see that as a consequence of the fact that conditioning de-
creases entropy, we see that for any sequence of (discrete or continuous, as appropriate) random
variables, we have
Both equalities hold with equality if and only if X1 , . . . , Xn are mutually independent. (The only
if follows because I(X; Y ) > 0 whenever X and Y are not independent, by Jensen’s inequality and
the fact that Dkl (P ||Q) > 0 unless P = Q.)
We return to information and divergence now. Suppose that random variables Yi are indepen-
dent conditional on X, meaning that
Such scenarios are common—as we shall see—when we make multiple observations from a fixed
distribution parameterized by some X. Then we have the inequality
n
X
I(X; Y1 , . . . , Yn ) = [H(Yi | Y1i−1 ) − H(Yi | X, Y1i−1 )]
i=1
n n n (2.1.7)
X X X
= [H(Yi | Y1i−1 ) − H(Yi | X)] ≤ [H(Yi ) − H(Yi | X)] = I(X; Yi ),
i=1 i=1 i=1
X → Y → Z.
Proposition 2.1.9. With the above Markov chain, we have I(X; Z) ≤ I(X; Y ).
where we note that the final equality follows because X is independent of Z given Y :
19
Lexture Notes on Statistics and Information Theory John Duchi
There are related data processing inequalities for the KL-divergence—which we generalize in
the next section—as well. In this case, we may consider a simple Markov chain X → Z. If we
let P1 and
R P2 be distributions on X and Q1 and Q2 be the induced distributions on Z, that is,
Qi (A) = P(Z ∈ A | x)dPi (x), then we have
Dkl (Q1 ||Q2 ) ≤ Dkl (P1 ||P2 ) ,
the basic KL-divergence data processing inequality. A consequence of this is that, for any function
f and random variables X and Y on the same space, we have
Dkl (f (X)||f (Y )) ≤ Dkl (X||Y ) .
We explore these data processing inequalities more when we generalize KL-divergences in the next
section and in the exercises.
20
Lexture Notes on Statistics and Information Theory John Duchi
qA (x) = i when x ∈ Ai .
2.2.2 KL-divergence
In this section, we present the general definition of a KL-divergence, which holds for any pair of
distributions. Let P and Q be distributions on a space X . Now, let A be a finite algebra on X
(as in the previous section, this is equivalent to picking a partition of X and then constructing the
associated algebra), and assume that its atoms are atoms(A). The KL-divergence between P and
Q conditioned on A is
X P (A)
Dkl (P ||Q | A) := P (A) log .
Q(A)
A∈atoms(A)
That is, we simply sum over the partition of X . Another way to write this is as follows. Let
q : X → {1, . . . , m} be a quantizer, and define the sets Ai = q−1 ({i}) to be the pre-images of each
i (i.e. the different quantization regions, or the partition of X that q induces). Then the quantized
KL-divergence between P and Q is
m
X P (Ai )
Dkl (P ||Q | q) := P (Ai ) log .
Q(Ai )
i=1
We may now give the fully general definition of KL-divergence: the KL-divergence between P
and Q is defined as
This also gives a rigorous definition of mutual information. Indeed, if X and Y are random variables
with joint distribution PXY and marginal distributions PX and PY , we simply define
21
Lexture Notes on Statistics and Information Theory John Duchi
while if P and Q both have probability mass functions p and q, then—as we see in Exercise 2.6—the
definition (2.2.1) is equivalent to
X p(x)
Dkl (P ||Q) = p(x) log ,
x
q(x)
Measure-theoretic definition of KL-divergence If you have never seen measure theory be-
fore, skim this section; while the notation may be somewhat intimidating, it is fine to always
consider only continuous or fully discrete distributions. We will describe an interpretation that will
mean for our purposes that one never needs to really think about measure theoretic issues.
The general definition (2.2.1) of KL-divergence is equivalent to the following. Let µ be a measure
on X , and assume that P and Q are absolutely continuous with respect to µ, with densities p and
q, respectively. (For example, take µ = P + Q.) Then
Z
p(x)
Dkl (P ||Q) = p(x) log dµ(x). (2.2.2)
X q(x)
The proof of this fact is somewhat involved, requiring the technology of Lebesgue integration. (See
Gray [94, Chapter 5].)
For those who have not seen measure theory, the interpretation
R of the equality (2.2.2) should be
as follows. When integrating a function f (x), replace f (x)dµ(x) with one of two pairsR of symbols:
one may simply think of dµ(x) as dx, so thatR we are performing standard integration f (x)dx, or
one should think
R of the integral
P operation f (x)dµ(x) as summing the argument of the integral, so
dµ(x) = 1 and f (x)dµ(x) = x f (x). (This corresponds to µ being “counting measure” on X .)
2.2.3 f -divergences
A more general notion of divergence is the so-called f -divergence, or Ali-Silvey divergence [4, 54]
(see also the alternate interpretations in the article by Liese and Vajda [131]). Here, the definition
is as follows. Let P and Q be probability distributions on the set X , and let f : R+ → R be a
22
Lexture Notes on Statistics and Information Theory John Duchi
convex function satisfying f (1) = 0. If X is a discrete set, then the f -divergence between P and Q
is
X p(x)
Df (P ||Q) := q(x)f .
x
q(x)
More generally, for any set X and a quantizer q : X → {1, . . . , m}, letting Ai = q−1 ({i}) = {x ∈
X | q(x) = i} be the partition the quantizer induces, we can define the quantized divergence
m
X P (Ai )
Df (P ||Q | q) = Q(Ai )f ,
Q(Ai )
i=1
and the general definition of an f divergence is (in analogy with the definition (2.2.1) of general
KL divergences)
The definition (2.2.3) shows that, any time we have computations involving f -divergences—such
as KL-divergence or mutual information—it is no loss of generality, when performing the compu-
tations, to assume that all distributions have finite discrete support. There is a measure-theoretic
version of the definition (2.2.3) which is frequently easier to use. Assume w.l.o.g. that P and Q are
absolutely continuous with respect to the base measure µ. The f divergence between P and Q is
then Z
p(x)
Df (P ||Q) := q(x)f dµ(x). (2.2.4)
X q(x)
This definition, it turns out, is not quite as general as we would like—in particular, it is unclear
how we should define the integral for points x such that q(x) = 0. With that in mind, we recall
that the perspective transform (see Appendices B.1.1 and B.3.3) of a function f : R → R is defined
by pers(f )(t, u) = uf (t/u) if u > 0 and by +∞ if u ≤ 0. This function is convex in its arguments
(Proposition B.3.12). In fact, this is not quite enough for the fully correct definition. The closure of
a convex function f is cl f (x) = sup{`(x) | ` ≤ f, ` linear}, the supremum over all linear functions
that globally lower bound f . Then [104, Proposition IV.2.2.2] the closer of pers(f ) is defined, for
any t0 ∈ int dom f , by
uf (t/u)
if u > 0
cl pers(f )(t, u) = limα↓0 αf (t0 − t + t/α) if u = 0
+∞ if u < 0.
(The choice of t0 does not affect the definition.) Then the fully general formula expressing the
f -divergence is Z
Df (P ||Q) = cl pers(f )(p(x), q(x))dµ(x). (2.2.5)
X
This is what we mean by equation (2.2.4), which we use without comment.
In the exercises, we explore several properties of f -divergences, including the quantized repre-
sentation (2.2.3), showing different data processing inequalities and orderings of quantizers based
on the fineness of their induced partitions. Broadly, f -divergences satisfy essentially the same prop-
erties as KL-divergence, such as data-processing inequalities, and they provide a generalization of
mutual information. We explore f -divergences from additional perspectives later—they are impor-
tant both for optimality in estimation and related to consistency and prediction problems, as we
discuss in Chapter 14.
23
Lexture Notes on Statistics and Information Theory John Duchi
Examples We give several examples of f -divergences here; in Section 8.2.2 we provide a few
examples of their uses as well as providing a few natural inequalities between them.
Example 2.2.1 (KL-divergence): By taking f (t) = t log t, which is convex and satisfies
f (1) = 0, we obtain Df (P ||Q) = Dkl (P ||Q). 3
Example 2.2.3 (Total variation distance): The total variation distance between probability
distributions P and Q defined on a set X is the maximum difference between probabilities they
assign on subsets of X :
where the second equality follows by considering compliments P (Ac ) = 1 − P (A). The total
variation distance, as we shall see later, is important for verifying the optimality of different
tests, and appears in the measurement of difficulty of solving hypothesis testing problems. The
choice f (t) = 21 |t − 1|, we obtain the total variation distance, that is, kP − QkTV = Df (P ||Q).
There are several alternative characterizations, which we provide as Lemma 2.2.4 next; it will
be useful in the sequel when we develop inequalities relating the divergences. 3
Lemma 2.2.4. Let P, Q be probability measures with densities p, q with respect to a base measure
µ and f (t) = 21 |t − 1|. Then
Z
1
kP − QkTV = Df (P ||Q) = |p(x) − q(x)|dµ(x)
2
Z Z
= [p(x) − q(x)]+ dµ(x) = [q(x) − p(x)]+ dµ(x)
24
Lexture Notes on Statistics and Information Theory John Duchi
Example 2.2.5 (Hellinger distance): The Hellinger distance between √ probability distribu-
√
2
tions P and Q defined on a set X is generated by the function f (t) = ( t − 1) = t − 2 t + 1.
The Hellinger distance is then
Z p
2 1 p
dhel (P, Q) := ( p(x) − q(x))2 dµ(x). (2.2.7)
2
The non-squared version dhel (P, Q) is indeed a distance between probability measures P and
Q. It is sometimes convenient to rewrite the Hellinger distance in terms of the affinity between
P and Q, as
Z Z p
1 p
dhel (P, Q)2 = (p(x) + q(x) − 2 p(x)q(x))dµ(x) = 1 − p(x)q(x)dµ(x), (2.2.8)
2
which makes clear that dhel (P, Q) ∈ [0, 1] is on roughly the same scale as the variation distance;
we will say more later. 3
Example 2.2.6 (χ2 divergence): The χ2 -divergence is generated by taking f (t) = (t − 1)2 ,
so that 2
p(x)2
Z Z
p(x)
Dχ2 (P ||Q) := − 1 q(x)dµ(x) = dµ(x) − 1, (2.2.9)
q(x) q(x)
where the equality is immediate because pdµ = qdµ = 1. 3
R R
25
Lexture Notes on Statistics and Information Theory John Duchi
Rp
As in Example 2.2.5, we have p(x)q(x)dµ(x) = 1 − dhel (P, Q)2 , so this (along with the repre-
sentation Lemma 2.2.4 for variation distance) implies
Z
1 1
kP − QkTV = |p(x) − q(x)|dµ(x) ≤ dhel (P, Q)(2 − d2hel (P, Q)) 2 .
2
√
For the lower bound on total variation, note that for any a, b ∈ R+ , we have a + b − 2 ab ≤ |a − b|
(check the cases a > b and a < b separately); thus
Z Z
2 1 h p i 1
dhel (P, Q) = p(x) + q(x) − 2 p(x)q(x) dµ(x) ≤ |p(x) − q(x)|dµ(x),
2 2
as desired.
Several important inequalitites relate the variation distance to the KL-divergence. We state
two important inequalities in the next proposition, both of which are important enough to justify
their own names.
Proposition 2.2.8. The total variation distance satisfies the following relationships.
Proof Exercise 2.19 outlines one proof of Pinsker’s inequality using the data processing inequality
(Proposition 2.2.13). We present an alternative via the Cauchy-Schwarz inequality. Using the
definition (2.2.1) of the KL-divergence, we may assume without loss of generality that P and Q are
finitely P
supported, say with p.m.f.s p1 , . . . , pm and q1 , . . . , qm . Define the negative entropy function
h(p) = m 2 1 2
i=1 pi log pi . Then showing that Dkl (P ||Q) ≥ 2 kP − QkTV = 2 kp − qk1 is equivalent to
showing that
1
h(p) ≥ h(q) + h∇h(q), p − qi + kp − qk21 , (2.2.11)
2
because by inspection h(p)−h(q)−h∇h(q), p−qi = i pi log pqii . We do this via a Taylor expansion:
P
we have
∇h(p) = [log pi + 1]m 2
i=1 and ∇ h(p) = diag([1/pi ]i=1 ).
m
By Taylor’s theorem, there is some p̃ = (1 − t)p + tq, where t ∈ [0, 1], such that
1
h(p) = h(q) + h∇h(q), p − qi + hp − q, ∇2 h(p̃)(p − q)i.
2
P
But looking at the final quadratic, we have for any vector v and any p ≥ 0 satisfying i pi = 1,
m m m 2
v2 v2
X
X X √ |vi |
2
hv, ∇ h(p̃)vi = i
= kpk1 i
≥ pi √ = kvk21 ,
pi pi pi
i=1 i=1 i=1
26
Lexture Notes on Statistics and Information Theory John Duchi
√ √
where the inequality follows from Cauchy-Schwarz applied to the vectors [ pi ]i and [|vi |/ pi ]i .
Thus inequality (2.2.11) holds. Rp
For the claim (b), we use Proposition 2.2.7. Let a = p(x)q(x)dµ(x) be a shorthand
√ √ for the
affinity, so that d2 (P, Q) = 1 − a. Then Proposition 2.2.7 gives kP − Qk ≤ 1 − a 1+a =
√ hel TV
1 − a2 . Now apply Jensen’s inequality to the exponential: we have
Z p Z s Z
q(x) 1 q(x)
p(x)q(x)dµ(x) = p(x)dµ(x) = exp log p(x)dµ(x)
p(x) 2 p(x)
Z
1 q(x) 1
≥ exp p(x) log dµ(x) = exp − Dkl (P ||Q) .
2 p(x) 2
√ q
In particular, 1 − a2 ≤ 1 − exp(− 12 Dkl (P ||Q))2 , which is the first claim of part (b). For the
√
second, note that 1 − c ≤ 1 − 12 c for c ∈ [0, 1] by concavity of the square root.
We also have the following bounds on the KL-divergence in terms of the χ2 -divergence.
Proposition 2.2.9. For any distributions P, Q,
It is also possible to relate mutual information between distributions to f -divergences, and even
to bound the mutual information above and below by the Hellinger distance for certain problems. In
this case, we consider the following situation: let V ∈ {0, 1} uniformly at random, and conditional
on V = v, draw X ∼ Pv for some distribution Pv on a space X . Then we have that
1 1
I(X; V ) = Dkl P0 ||P + Dkl P1 ||P
2 2
where P = 21 P0 + 12 P1 . The divergence measure on the right side of the preceding identity is a
special case of the Jenson-Shannon divergence, defined for λ ∈ [0, 1] by
which is a symmetrized and bounded variant of the typical KL-divergence (we use the shorthand
Djs (P ||Q) := Djs, 1 (P ||Q) for the symmetric case). As a consequence, we also have
2
1 1
I(X; V ) = Df (P0 ||P1 ) + Df (P1 ||P0 ) ,
2 2
1
where f (t) = −t log( 2t + 21 ) = t log t+1
2t
, so that the mutual information is a particular f -divergence.
This form—as we see in the later chapters—is frequently convenient because it gives an object
with similar tensorization properties to KL-divergence while enjoying the boundedness properties
of Hellinger and variation distances. The following proposition captures the latter properties.
27
Lexture Notes on Statistics and Information Theory John Duchi
But of course the final integral is kP1 − P0 kTV , giving I(X; V ) ≤ log 2 kP0 −pP1 kTV . Conversely,
for the lower bound on Djs (P0 ||P1 ), we use the upper bound h2 (p) ≤ 2 log 2 · p(1 − p) to obtain
Z r
1 p0 p0
I(X; V ) ≥ 1 − (p0 + p1 ) 1− dµ
log 2 p1 + p0 p1 + p0
√ √ √
Z Z
1
=1− p0 p1 dµ = ( p0 − p1 )2 dµ = d2hel (P0 , P1 )
2
as desired.
The Hellinger-based upper bound is simpler: by Proposition 2.2.9, we have
1 1
Djs (P0 ||P1 ) = Dkl (P0 ||(P0 + P1 )/2) + Dkl (P1 ||(P0 + P1 )/2)
2 2
1 1
≤ Dχ2 (P0 ||(P0 + P1 )/2) + Dχ2 (P1 ||(P0 + P1 )/2)
2 2
Z √ √ √ √
(p0 − p1 )2 ( p0 − p1 )2 ( p0 + p1 )2
Z
1 1
= dµ = dµ.
2 p0 + p1 2 p0 + p1
√ √
Now note that (a + b)2 ≤ 2aR2 + 2b2 for any a, b ∈ R, and so ( p0 + p1 )2 ≤ 2(p0 + p1 ), and thus
√ √
the final integral has bound ( p0 − p1 )2 dµ = 2d2hel (P0 , P1 ).
28
Lexture Notes on Statistics and Information Theory John Duchi
Df (λP1 + (1 − λ)P2 ||λQ1 + (1 − λ)Q2 ) ≤ λDf (P1 ||Q1 ) + (1 − λ)Df (P2 ||Q2 ) .
The proof of this proposition we leave as Exercise 2.11, which we treat as a consequence of the
more general “log-sum” like inequalities of Exercise 2.8. It is, however, an immediate consequence
of the fully specified definition (2.2.5) of an f -divergence, because pers(f ) is jointly convex. As an
immediate corollary, we see that the same result is true for KL-divergence as well.
Corollary 2.2.12. The KL-divergence Dkl (P ||Q) is jointly convex in its arguments P and Q.
We can also provide more general data processing inequalities for f -divergences, paralleling
those for the KL-divergence. In this case, we consider random variables X and Z on spaces X
and Z, respectively, and a Markov transition kernel K giving the Markov chain X → Z. That
is, K(· | x) is a probability distribution on Z for each x ∈ X , and conditioned on X = x, Z has
distribution K(· | x) so that K(A | x) = P(Z ∈ A | X = x). Certainly, this includes the situation
when Z = φ(X) for some function φ, and more generally when Z = φ(X, U ) for a function φ and
some additional randomness U . For a distribution P on X, we then define the marginals
Z
KP (A) := K(A, x)dP (x).
X
Proposition 2.2.13. Let P and Q be distributions on X and let K be any Markov kernel. Then
Thus, further processing of random variables can only bring them “closer” in the space of distribu-
tions; downstream processing of signals cannot make them further apart as distributions.
29
Lexture Notes on Statistics and Information Theory John Duchi
Proposition 2.3.1. Let X be an arbitrary set. For any distributions P1 and P2 on X , we have
Proof Any test Ψ : X → {1, 2} has an acceptance region, call it A ⊂ X , where it outputs 1 and
a region Ac where it outputs 2.
inf {P1 (Ψ 6= 1) + P2 (Ψ 6= 2)} = inf {1 − (P1 (A) − P2 (A))} = 1 − sup (P1 (A) − P2 (A)),
Ψ A⊂X A⊂X
30
Lexture Notes on Statistics and Information Theory John Duchi
In the two-hypothesis case, we also know that the optimal test, by the Neyman-Pearson lemma,
is a likelihood ratio test. That is, assuming that P1 and P2 have densities p1 and p2 , the optimal
test is of the form
1 if pp12 (X)
(
(X) ≥ t
Ψ(X) = p1 (X)
2 if p2 (X) < t
for some threshold t ≥ 0. In the case that the prior probabilities on P1 and P2 are each 21 , then
t = 1 is optimal.
We give one example application of Proposition 2.3.1 to the problem of testing a normal mean.
iid
Example 2.3.2 (Testing a normal mean): Suppose we observe X1 , . . . , Xn ∼ P for P = P1
or P = P2 , where Pv is the normal distribution N(µv , σ 2 ), where µ1 6= µ2 . We would like to
understand the sample size n necessary to guarantee that no test can have small error, that
is, say, that
1
inf {P1 (Ψ(X1 , . . . , Xn ) 6= 1) + P2 (Ψ(X1 , . . . , Xn ) 6= 2)} ≥ .
Ψ 2
By Proposition 2.3.1, we have that
iid
where Pvn denotes the n-fold product of Pv , that is, the distribution of X1 , . . . , Xn ∼ Pv .
The interaction between total variation distance and product distributions is somewhat subtle,
so it is often advisable to use a divergence measure more attuned to the i.i.d. nature of the sam-
pling scheme. Two such measures are the KL-divergence and Hellinger distance, both of which
we explore in the coming chapters. With that in mind, we apply Pinsker’s inequality (2.2.10)
to see that kP1n − P2n k2TV ≤ 21 Dkl (P1n ||P2n ) = n2 Dkl (P1 ||P2 ), which implies that
r r 1 √
n 1 n 1 2 n |µ1 − µ2 |
1− kP1n − P2n kTV ≥1− Dkl (P1 ||P2 ) 2 = 1 − 2
(µ1 − µ2 )2
=1− .
2 2 2σ 2 σ
σ2
In particular, if n ≤ (µ1 −µ2 )2
, then we have our desired lower bound of 21 .
2
Conversely, a calculation yields that n ≥ (µ1Cσ
−µ2 )2
, for some numerical constant C ≥ 1, implies
small probability of error. We leave this calculation to the reader. 3
31
Lexture Notes on Statistics and Information Theory John Duchi
and we wish to provide lower bounds on the probability of error—that is, that X b 6= X. If we let
the function h2 (p) = −p log p − (1 − p) log(1 − p) denote the binary entropy (entropy of a Bernoulli
random variable with parameter p), Fano’s inequality takes the following form [e.g. 53, Chapter 2]:
Proposition 2.3.3 (Fano inequality). For any Markov chain X → Y → X,
b we have
b 6= X)) + P(X
h2 (P(X b 6= X) log(|X | − 1) ≥ H(X | X).
b (2.3.1)
Proof This proof follows by expanding an entropy functional in two different ways. Let E be
b 6= X, that is, E = 1 if X
the indicator for the event that X b 6= X and is 0 otherwise. Then we have
H(X, E | X)
b = H(X | E, X)
b + H(E | X)
b
= P(E = 1)H(X | E = 1, X)
b + P(E = 0) H(X | E = 0, X)
b +H(E | X),
b
| {z }
=0
where the zero follows because given there is no error, X has no variability given X.
b Expanding
the entropy by the chain rule in a different order, we have
H(X, E | X)
b = H(X | X)
b + H(E | X,
b X),
| {z }
=0
H(X | X)
b = H(X, E | X)
b = P(E = 1)H(X | E = 1, X)
b + H(E | X).
Noting that H(E | X) ≤ H(E) = h2 (P(E = 1)), as conditioning reduces entropy, and that
H(X | E = 1, X)
b ≤ log(|X | − 1), as X can take on at most |X | − 1 values when there is an error,
completes the proof.
b 6= X) ≥ 1 − I(X; Y ) + log 2
P(X . (2.3.2)
log(|X |)
Proof Let Perror = P(X 6= X) b denote the probability of error. Noting that h2 (p) ≤ log 2 for any
p ∈ [0, 1] (recall inequality (2.1.2), that is, that uniform random variables maximize entropy), then
using Proposition 8.4.1, we have
(i)
b (ii)
log 2 + Perror log(|X |) ≥ h2 (Perror ) + Perror log(|X | − 1) ≥ H(X | X) = H(X) − I(X; X).
b
Here step (i) uses Proposition 2.3.3 and step (ii) uses the definition of mutual information, that
b = H(X) − H(X | X).
I(X; X) b The data processing inequality implies that I(X; X) b ≤ I(X; Y ),
and using H(X) = log(|X |) completes the proof.
32
Lexture Notes on Statistics and Information Theory John Duchi
In particular, Corollary 2.3.4 shows that when X is chosen uniformly at random and we observe
Y , we have
I(X; Y ) + log 2
inf P(Ψ(Y ) 6= X) ≥ 1 − ,
Ψ log |X |
where the infimum is taken over all testing procedures Ψ. Some interpretation of this quantity
is helpful. If we think roughly of the number of bits it takes to describe a variable X uniformly
chosen from X , then we expect that log2 |X | bits are necessary (and sufficient). Thus, until we
collect enough information that I(X; Y ) ≈ log |X |, so that I(X; Y )/ log |X | ≈ 1, we are unlikely to
be unable to identify the variable X with any substantial probability. So we must collect enough
bits to actually discover X.
Example 2.3.5 (20 questions game): In the 20 questions game—a standard children’s game—
there are two players, the “chooser” and the “guesser,” and an agreed upon universe X . The
chooser picks an element x ∈ X , and the guesser’s goal is to find x by using a series of yes/no
questions about x. We consider optimal strategies for each player in this game, assuming that
X is finite and letting m = |X | be the universe size for shorthand.
For the guesser, it is clear that at most dlog2 me questions are necessary to guess the item
X that the chooser has picked—at each round of the game, the guesser asks a question that
eliminates half of the remaining possible items. Indeed, let us assume that m = 2l for some
l ∈ N; if not, the guesser can always make her task more difficult by increasing the size of X
until it is a power of 2. Thus, after k rounds, there are m2−k items left, and we have
k
1
m ≤ 1 if and only if k ≥ log2 m.
2
For the converse—the chooser’s strategy—let Y1 , Y2 , . . . , Yk be the sequence of yes/no answers
given to the guesser. Assume that the chooser picks X uniformly at random in X . Then Fano’s
inequality (2.3.2) implies that for the guess X
b the guesser makes,
b 6= X) ≥ 1 − I(X; Y1 , . . . , Yk ) + log 2
P(X .
log m
By the chain rule for mutual information, we have
k
X k
X k
X
I(X; Y1 , . . . , Yk ) = I(X; Yi | Y1:i−1 ) = H(Yi | Y1:i−1 ) − H(Yi | Y1:i−1 , X) ≤ H(Yi ).
i=1 i=1 i=1
As the answers Yi are yes/no, we have H(Yi ) ≤ log 2, so that I(X; Y1:k ) ≤ k log 2. Thus we
find
P(Xb 6= X) ≥ 1 − (k + 1) log 2 = log2 m − 1 − k ,
log m log2 m log2 m
so that we the guesser must have k ≥ log2 (m/2) to be guaranteed that she will make no
mistakes. 3
33
Lexture Notes on Statistics and Information Theory John Duchi
34
Lexture Notes on Statistics and Information Theory John Duchi
0 2
1
x1
2 0 2
0 1 1
x2 x3 x5 x6 x7
Figure 2.1. Prefix-tree encoding of a set of symbols. The encoding for x1 is 0, for x2 is 10, for x3
is 11, for x4 is 12, for x5 is 20, for x6 is 21, and nothing is encoded as 1, 2, or 22.
Theorem 2.4.2. Let X be a finite or countable set, and let ` : X → N be a function. If `(x) is the
length of the encoding of the symbol x in a uniquely decodable d-ary code, then
X
d−`(x) ≤ 1. (2.4.1)
x∈X
Conversely, given any function ` : X → N satisfying inequality (2.4.1), there is a prefix code whose
codewords have length `(x) for each x ∈ X .
Proof We prove the first statement of the theorem first by a counting and asymptotic argument.
We begin by assuming that X is finite; we eliminate this assumption subsequently. As a
consequence, there is some maximum length `max such that `(x) ≤ `max for all x ∈ X . ForP a sequence
x1 , . . . , xn ∈ X , we have by the definition of our encoding strategy that `(x1 , . . . , xn ) = ni=1 `(xi ).
In addition, for each m we let
En (m) := {x1:n ∈ X n such that `(x1:n ) = m}
denote the symbols x encoded with codewords of length m in our code, then as the code is uniquely
decodable we certainly have card(En (m)) ≤ dm P for all n and m. Moreover, for all x1:n ∈ X n we
have `(x1:n ) ≤ n`max . We thus re-index the sum x d−`(x) and compute
X n`
X max
≤ dm−m = n`max .
m=1
35
Lexture Notes on Statistics and Information Theory John Duchi
as each subset {xP ∈ X : `(x) ≤ k} is uniquely decodable, we have Dk ≤ 1 for all k. Then
1 ≥ limk→∞ Dk = x∈X d−`(x) .
The achievability of such a code is straightforward by a pictorial argument (recall Figure 2.1),
so we sketch the result non-rigorously. Indeed, let Td be an (infinite) d-ary tree. Then, at each
level m of the tree, assign one of the nodes at that level to each symbol x ∈ X such that `(x) = m.
Eliminate the subtree below that node, and repeat with the remaining symbols. The codeword
corresponding to symbol x is then the path to the symbol in the tree.
JCD Comment: Fill out this proof, potentially deferring it.
With the Kraft-McMillan theorem in place, we we may directly relate the entropy of a random
variable to the length of possible encodings for the variable; in particular, we show that the entropy
is essentially the best possible code length of a uniquely decodable source code. In this theorem,
we use the shorthand X
Hd (X) := − p(x) logd p(x).
x∈X
Theorem 2.4.3. Let X ∈ X be a discrete random variable distributed according to P and let `C
be the length function associated with a d-ary encoding C : X → {0, . . . , d − 1}∗ . In addition, let C
be the set of all uniquely decodable d-ary codes for X . Then
Proof The lower bound is an argument by convex optimization, while for the upper bound
we give an explicit length function and (implicit) prefix code attaining the bound. For the lower
bound, we assume for simplicity that X is finite, and we identify X = {1, . . . , |X |} (let m = |X | for
shorthand). Then as C consists of uniquely decodable codebooks, all the associated length functions
must satisfy the Kraft-McMillan inequality (2.4.1). Letting `i = `(i), the minimal encoding length
is at least (m m
)
X X
−`i
infm pi `i : d ≤1 .
`∈R
i=1 i=1
36
Lexture Notes on Statistics and Information Theory John Duchi
By introducing the Lagrange multiplier λ ≥ 0 for the inequality constraint, we may write the
Lagrangian for the preceding minimization problem as
n
!
X h im
> −`i
L(`, λ) = p ` + λ d −1 with ∇` L(`, λ) = p − λ d−`i log d .
i=1
i=1
θ
θ Pm − logd
In particular, the optimal ` satisfies `i = logd pi for some constant θ, and solving i=1 d
pi
=1
gives θ = 1 and `(i) = logd p1i .
l m
1
To attain the result, simply set our encoding to be `(x) = logd P (X=x) , which satisfies the
Kraft-McMillan inequality and thus yields a valid prefix code with
X 1 X
EP [`(X)] = p(x) logd ≤− p(x) logd p(x) + 1 = Hd (X) + 1
p(x)
x∈X x∈X
as desired.
Theorem 2.4.3 thus shows that, at least to within an additive constant of 1, the entropy both
upper and lower bounds the expected length of a uniquely decodable code for the random variable
X. This is the first of our promised “operational interpretations” of the entropy.
In some situations, the limit (2.4.2) may not exist. However, there are a variety of situations in
which it does, and we focus generally on a specific but common instance in which the limit does
exist. First, we recall the definition of a stationary sequence of random variables.
Definition 2.5. We say a sequence X1 , X2 , . . . of random variable is stationary if for all n and all
k ∈ N and all measurable sets A1 , . . . , Ak ⊂ X we have
37
Lexture Notes on Statistics and Information Theory John Duchi
Proposition 2.4.4. Let the sequence of random variables {Xi }, taking values in the discrete space
X , be stationary. Then
H({Xi }) = lim H(Xn | X1 , . . . , Xn−1 )
n→∞
Finally, we present a result showing that it is possible to achieve average code length of at most
the entropy rate, which for stationary sequences is smaller than the entropy of any single random
variable Xi . To do so, we require the use of a block code, which (while it may be prefix code) treats
sets of random variables (X1 , . . . , Xm ) ∈ X m as a single symbol to be jointly encoded.
Proposition 2.4.5. Let the sequence of random variables X1 , X2 , . . . be stationary. Then for any
> 0, there exists an m ∈ N and a d-ary (prefix) block encoder C : X m → {0, . . . , d − 1}∗ such that
1
lim EP [`C (X1:n )] ≤ H({Xi }) + = lim H(Xn | X1 , . . . , Xn−1 ) + .
n n n
Taking n → ∞ yields that the term N (cN − a)/n → 0, which gives that cn − a ∈ [−, ] eventually for any > 0,
which is our desired result.
38
Lexture Notes on Statistics and Information Theory John Duchi
Note that if the m does not divide n, we may also encode the length of the sequence of encoded
words in each block of length m; in particular, if the block begins with a 0, it encodes m symbols,
while if it begins with a 1, then the next dlogd me bits encode the length of the block. This would
yields an increase in the expected length of the code to
2n + dlog2 me n
EP [`C (X1:n )] ≤ + H(X1 , . . . , Xm ).
m m
Dividing by n and letting n → ∞ gives the result, as we can always choose m large.
2.5 Bibliography
The material in this chapter is classical in information theory. For all of our treatment of mutual
information, entropy, and KL-divergence in the discrete case, Cover and Thomas provide an es-
sentially complete treatment in Chapter 2 of their book [53]. Gray [94] provides a more advanced
(measure-theoretic) version of these results, with Chapter 5 covering most of our results (or Chap-
ter 7 in the newer addition of the same book). Csiszár and Körner [55] is the classic reference for
coding theorems and results on communication, including stronger converse results.
The f -divergence was independently discovered by Ali and Silvey [4] and Csiszár [54], and is
consequently sometimes called an Ali-Silvey divergence or Csiszár divergence. Liese and Vajda [131]
provide a survey of f -divergences and their relationships with different statistical concepts (taking
a Bayesian point of view), and various authors have extended the pairwise divergence measures to
divergence measures between multiple distributions [98], making connections to experimental design
and classification [89, 70], which we investigate later in book. The inequalities relating divergences
in Section 2.2.4 are now classical, and standard references present them [127, 167]. For a proof that
equality (2.2.4) is equivalent to the definition (2.2.3) with the appropriate closure operations, see
the paper [70, Proposition 1]. We borrow the proof of the upper bound in Proposition 2.2.10 from
the paper [132].
2.6 Exercises
Our first few questions investigate properties of a divergence between distributions that is weaker
than the KL-divergence, but is intimately related to optimal testing. Let P1 and P2 be arbitrary
distributions on a space X . The total variation distance between P1 and P2 is defined as
Exercise 2.1: Prove the following identities about total variation. Throughout, let P1 and P2
have densities p1 and p2 on a (common) set X .
R
(a) 2 kP1 − P2 kTV = |p1 (x) − p2 (x)|dx.
(b) For functions f : X → R, Rdefine the supremum norm kf k∞ = supx∈X |f (x)|. Show that
2 kP1 − P2 kTV = supkf k∞ ≤1 X f (x)(p1 (x) − p2 (x))dx.
R
(c) kP1 − P2 kTV = max{p1 (x), p2 (x)}dx − 1.
39
Lexture Notes on Statistics and Information Theory John Duchi
R
(d) kP1 − P2 kTV = 1 − min{p1 (x), p2 (x)}dx.
Exercise 2.2 (Divergence between multivariate normal distributions): Let P1 be N(θ1 , Σ) and
P2 be N(θ2 , Σ), where Σ 0 is a positive definite matrix. What is Dkl (P1 ||P2 )?
Exercise 2.3 (The optimal test between distributions): Prove Le-Cam’s inequality: for any
function ψ with dom ψ ⊃ X and any distributions P1 , P2 ,
Thus, the sum of the probabilities of error in a hypothesis testing problem, where based on a sample
X we must decide whether P1 or P2 is more likely, has value at least 1 − kP1 − P2 kTV . Given P1
and P2 is this risk attainable?
Exercise 2.4: A random variable X has Laplace(λ, µ) distribution if it has density p(x) =
λ
2 exp(−λ|x−µ|). Consider the hypothesis test of P1 versus P2 , where X has distribution Laplace(λ, µ1 )
under P1 and distribution Laplace(λ, µ2 ) under P2 , where µ1 < µ2 . Show that the minimal value
over all tests ψ of P1 versus P2 is
λ
inf P1 (ψ(X) 6= 1) + P2 (ψ(X) 6= 2) = exp − |µ1 − µ2 | .
ψ 2
Exercise 2.7 (f -divergences generalize standard divergences): Show the following properties of
f -divergences:
40
Lexture Notes on Statistics and Information Theory John Duchi
(c) If f (t) = t log t − log t, then Df (P ||Q) = Dkl (P ||Q) + Dkl (Q||P ).
(d) For any convex f satisfying f (1) = 0, Df (P ||Q) ≥ 0. (Hint: use Jensen’s inequality.)
(b) Generalizing the preceding result, let a : X → R+ and b : X → R+ , and let µ be a finite
measure on X with respect to which a is integrable. Show that
Z R Z
b(x)dµ(x) b(x)
a(x)dµ(x)f R ≤ a(x)f dµ(x).
a(x)dµ(x) a(x)
If you are unfamiliarR with measure theory, prove the following essentially equivalent result: let
u : X → R+ satisfy u(x)dx < ∞. Show that
Z R Z
b(x)u(x)dx b(x)
a(x)u(x)dxf R ≤ a(x)f u(x)dx
a(x)u(x)dx a(x)
R
whenever a(x)u(x)dx
R < ∞. (It is possible to demonstrate this remains true under appropriate
limits even when a(x)u(x)dx = +∞, but it is a mess.)
(Hint: use the fact that the perspective of a function f , defined by h(x, t) = tf (x/t) for t > 0, is
jointly convex in x and t (see Proposition B.3.12).
Exercise 2.9 (Data processing and f -divergences I): As with the KL-divergence, given a quantizer
g of the set X , where g induces a partition A1 , . . . , Am of X , we define the f -divergence between
P and Q conditioned on g as
m m
P (g −1 ({i}))
X P (Ai ) X
−1
Df (P ||Q | g) := Q(Ai )f = Q(g ({i}))f .
Q(Ai ) Q(g −1 ({i}))
i=1 i=1
Given quantizers g1 and g2 , we say that g1 is a finer quantizer than g2 under the following condition:
assume that g1 induces the partition A1 , . . . , An and g2 induces the partition B1 , . . . , Bm ; then for
any of the sets Bi , there are exists some k and sets Ai1 , . . . , Aik such that Bi = ∪kj=1 Aij . We let
g1 ≺ g2 denote that g1 is a finer quantizer than g2 .
(a) Let g1 and g2 be quantizers of the set X , and let g1 ≺ g2 , meaning that g1 is a finer quantization
than g2 . Prove that
Df (P ||Q | g2 ) ≤ Df (P ||Q | g1 ) .
41
Lexture Notes on Statistics and Information Theory John Duchi
Equivalently, show that whenever A and B are collections of sets partitioning X , but A is a
finer partition of X than B, that
X
X P (B) P (A)
Q(B)f ≤ Q(A)f .
Q(B) Q(A)
B∈B A∈A
(b) Suppose that X is countable (or finite) so that P and Q have p.m.f.s p and q. Show that
X p(x)
Df (P ||Q) = q(x)f ,
x
q(x)
where on the left we are using the partition definition (2.2.3); you should show that the partition
into discrete parts of X achieves the supremum. You may assume that X is finite. (Though
feel free to prove the result in the case that X is infinite.)
Exercise 2.10 (General data processing inequalities): Let f be a convex function satisfying
f (1) = 0. Let K be a Markov transition kernel from X to Z, that is, K(·, x) is a probability
distribution on Z for each x ∈ X . (Written differently, we have X → Z, and conditioned on X = x,
Z has distribution K(·, x), so that K(A, x) is the probability that Z ∈ A given X = x.)
R R
(a) Define the marginals KP (A) = K(A, x)p(x)dx and KQ (A) = K(A, x)q(x)dx. Show that
Hint: by equation (2.2.3), w.l.o.g. we may assume that Z is finite and Z = {1, . . . , m}; also
recall Question 2.8.
(b) Let X and Y be random variables with joint distribution PXY and marginals PX and PY .
Define the f -information between X and Y as
Use part (a) to show the following general data processing inequality: if we have the Markov
chain X → Y → Z, then
If (X; Z) ≤ If (X; Y ).
Exercise 2.11 (Convexity of f -divergences): Prove Proposition 2.2.11. Hint: Use Question 2.8.
Exercise 2.12 (Variational forms of KL divergence): Let P and Q be arbitrary distributions on a
common space X . Prove the following variational representation, known as the Donsker-Varadhan
theorem, of the KL divergence:
Dkl (P ||Q) = sup EP [f (X)] − log EQ [exp(f (X))] .
f :EQ [ef (X) ]<∞
42
Lexture Notes on Statistics and Information Theory John Duchi
Exercise 2.13: Let P and Q have densities p and q with respect to the base measure µ over the
set X . (Recall that this is no loss of generality, as we may take µ = P + Q.) Define the support
supp P := {x ∈ X : p(x) > 0}. Show that
1
Dkl (P ||Q) ≥ log .
Q(supp P )
Exercise 2.14: Let P1 be N(θ1 , Σ1 ) and P2 be N(θ2 , Σ2 ), where Σi 0 are positive definite
matrices. Give Dkl (P1 ||P2 ).
Exercise 2.15: Let {Pv }v∈V be an arbitrary collection of distributions on a space X and µ be a
probability measure on V. Show that if V ∼ µ and conditional on V = v, we draw X ∼ Pv , then
R R
(a) I(X; V ) = Dkl Pv ||P dµ(v), where P = Pv dµ(v) is the (weighted) average of the Pv . You
may assume that V is discrete if you like.
R
(b) For any distribution
R Q on X , I(X; V ) = Dkl (Pv ||Q) dµ(v)R − Dkl P ||Q . Conclude that
I(X; V ) ≤ Dkl (Pv ||Q) dµ(v), or, equivalently, P minimizes Dkl (Pv ||Q) dµ(v) over all prob-
abilities Q.
Exercise 2.16 (The triangle inequality for variation distance): Let P and Q be distributions
on X1n = (X1 , . . . , Xn ) ∈ X n , and let Pi (· | xi−1
1 ) be the conditional distribution of Xi given
i−1 i−1
X1 = x1 (and similarly for Qi ). Show that
n
X h i
kP − QkTV ≤ EP Pi (· | X1i−1 ) − Qi (· | X1i−1 ) TV
,
i=1
(b) Define the negative binary entropy h(p) = p log p + (1 − p) log(1 − p). Show that
43
Lexture Notes on Statistics and Information Theory John Duchi
Exercise 2.20: Use the paper “A New Metric for Probability Distributions” by Dominik p Endres
and Johannes Schindelin to prove that if V ∼ Uniform{0, 1} and X | V = v ∼ Pv , then I(X; V )
is a metric on distributions. (Said differently, Djs (P ||Q)1/2 is a metric on distributions, and it
generates the same topology as the TV-distance.)
Exercise 2.21: Relate the generalized Jensen-Shannon divergence between m distributions to
redundancy in encoding.
44
Chapter 3
Our second introductory chapter focuses on readers who may be less familiar with statistical mod-
eling methodology and the how and why of fitting different statistical models. As in the preceding
introductory chapter on information theory, this chapter will be a fairly terse blitz through the main
ideas. Nonetheless, the ideas and distributions here should give us something on which to hang our
hats, so to speak, as the distributions and models provide the basis for examples throughout the
book. Exponential family models form the basis of much of statistics, as they are a natural step
away from the most basic families of distributions—Gaussians—which admit exact computations
but are brittle, to a more flexible set of models that retain enough analytical elegance to permit
careful analyses while giving power in modeling. A key property is that fitting exponential family
models reduces to the minimization of convex functions—convex optimization problems—an oper-
ation we treat as a technology akin to evaluating a function like sin or cos. This perspective (which
is accurate enough) will arise throughout this book, and informs the philosophy we adopt that once
we formulate a problem as convex, it is solved.
In the continuous case, pθ is instead a density on X ⊂ Rk , and pθ takes the identical form above
but Z
A(θ) = log h(x) exp(hθ, φ(x)i)dx.
X
45
Lexture Notes on Statistics and Information Theory John Duchi
We can abstract away from this distinction between discrete and continuous distributions by making
the definition measure-theoretic, which we do here for completeness. (But recall the remarks in
Section 1.3.)
With our notation, we have the following definition.
Definition 3.1. The exponential family associated with the function φ and base measure µ is
defined as the set of distributions with densities pθ with respect to µ, where
Θ := {θ | A(θ) < ∞}
is open.
In Definition 3.1, we have included the carrier h in the base measure µ, and frequently we will give
ourselves the general notation
In some scenarios, it may be convient to re-parameterize the problem in terms of some function
η(θ) instead of θ itself; we will not worry about such issues and simply use the formulae that are
most convenient.
We now give a few examples of exponential family models.
Example 3.1.2 (Poisson distribution): The Poisson distribution (for count data) is usually
parameterized by some λ > 0, and for x ∈ N has distribution Pλ (X = x) = (1/x!)λx e− λ. Thus
by taking µ to be counting (discrete) measure on {0, 1, . . .} and setting θ = log λ, we find the
density (probability mass function in this case)
1 x −λ 1 1
p(x) = λ e = exp(x log λ − λ) = exp(xθ − eθ ) .
x! x! x!
Notably, taking h(x) = (x!)−1 and log-partition A(θ) = eθ , we have probability mass function
pθ (x) = h(x) exp(θx − A(θ)). 3
46
Lexture Notes on Statistics and Information Theory John Duchi
Example 3.1.3 (Normal distribution, mean parameterization): For the d-dimensional normal
distribution, we take µ to be Lebesgue measure on Rd . If we fix the covariance and vary only
the mean µ in the family N(µ, Σ), then X ∼ N(µ, Σ) has density
1 > −1 1
pµ (x) = exp − (x − µ) Σ (x − µ) − log det(2πΣ) .
2 2
In particular, we have carrier h(x) = exp(− 12 x> Σ−1 x)/((2π)d/2 det(Σ)), sufficient statistic
φ(x) = x, and log partition A(θ) = 21 θ> Σ−1 θ. 3
Example 3.1.4 (Normal distribution): Let X ∼ N(µ, Σ). We may re-parameterize this as
as Θ = Σ−1 and θ = Σ−1 µ, and we have density
1
pθ,Θ (x) ∝ exp hθ, xi − hxx> , Θi ,
2
where h·, ·i denotes the Euclidean inner product. See Exercise 3.1. 3
In some cases, it is analytically convenient to include a few more conditions on the exponential
family.
Definition 3.2. Let {Pθ }θ∈Θ be an exponential family as in Definition 3.1. The sufficient statistic
φ is minimal if Θ = dom A ⊂ Rd is full-dimensional and there exists no vector u such that
Definition 3.2 is essentially equivalent to stating that φ(x) = (φ1 (x), . . . , φd (x)) has linearly inde-
pendent components when viewed as vectors [φi (x)]x∈X . While we do not prove this, via a suitable
linear transformation—a variant of Gram-Schmidt orthonormalization—one may modify any non-
minimal exponential family {Pθ } into an equivalent minimal exponential family {Qη }, meaning
that the two collections satisfy the equality {Pθ } = {Qη } (see Brown [39, Chapter 1]).
47
Lexture Notes on Statistics and Information Theory John Duchi
Here, we enumerate a few of their keyR analytical properties, focusing on the cumulant generating
(or log partition) function A(θ) = log ehθ,φ(x)i dµ(x). We begin with a heuristic calculation, where
we assume that we exchange differentiation and integration. Assuming that this is the case, we
then obtain the important expectation and covariance relationships that
Z
1
∇A(θ) = R hθ,φ(x)i ∇θ ehθ,φ(x)i dµ(x)
e dµ(x)
Z Z
−A(θ) hθ,φ(x)i
=e ∇θ e dµ(x) = φ(x)ehθ,φ(x)i−A(θ) dµ(x) = Eθ [φ(X)]
because ehθ,φ(x)i−A(θ) = pθ (x). A completely similar (and still heuristic, at least at this point)
calculation gives
That these identities hold is no accident and is central to the appeal of exponential family models.
The first and, from our perspective, most important result about exponential family models is
their convexity. While (assuming the differentiation relationships above hold) the differentiation
identity that ∇2 A(θ) = Covθ (φ(X)) 0 makes convexity of A immediate, one can also provide a
direct argument without appealing to differentiation.
Proposition 3.2.1. The cumulant-generating function θ 7→ A(θ) is convex, and it is strictly convex
if and only if Covθ (φ(X)) is positive definite for all θ ∈ dom A.
Proof Let θλ = λθ1 + (1 − λ)θ2 , where θ1 , θ2 ∈ Θ. Then 1/λ ≥ 1 and 1/(1 − λ) ≥ 1, and Hölder’s
inequality implies
Z Z
log exp(hθλ , φ(x)i)dµ(x) = log exp(hθ1 , φ(x)i)λ exp(hθ2 , φ(x)i)1−λ dµ(x)
Z λ Z 1−λ
λ 1−λ
≤ log exp(hθ1 , φ(x)i) dµ(x)
λ exp(hθ2 , φ(x)i) 1−λ dµ(x)
Z Z
= λ log exp(hθ1 , φ(x)i)dµ(x) + (1 − λ) log exp(hθ2 , φ(x)i)dµ(x),
as desired. The strict convexity will be a consequence of Proposition 3.2.2 to come, as there we
formally show that ∇2 A(θ) = Covθ (φ(X)).
We now show that A(θ) is indeed infinitely differentiable and how it generates the moments of
the sufficient statistics φ(x). To describe the properties, we provide a bit of notation related to
tensor products: for a vector x ∈ Rd , we let
x⊗k := x
| ⊗x⊗ {z· · · ⊗ x}
k times
denote the kth order tensor, or multilinear operator, that for v1 , . . . , vk ∈ Rd satisfies
k
Y
⊗k
x (v1 , . . . , vk ) := hx, v1 i · · · hx, vk i = hx, vi i.
i=1
48
Lexture Notes on Statistics and Information Theory John Duchi
When k = 2, this is the familiar outer product x⊗2 = xx> . (More generally, one may think of x⊗k
as a d × d × · · · × d box, where the (i1 , . . . , ik ) entry is [x⊗k ]i1 ,...,ik = xi1 · · · xik .) With this notation,
our first key result regards the differentiability of A, where we can compute (all) derivatives of eA(θ)
by interchanging integration and differentiation.
The proof of the proposition is involved and requires complex analysis, so we defer it to Sec. 3.6.1.
As particular consequences of Proposition 3.2.2, we can rigorously demonstrate the expectation
and covariance relationships that
Z Z
1 hθ,φ(x)i
∇A(θ) = R hθ,φ(x)i ∇e dµ(x) = φ(x)pθ (x)dµ(x) = Eθ [φ(X)]
e dµ(x)
and
( φ(x)ehθ,φ(x)i dµ(x))⊗2
Z R
2 1 ⊗2 hθ,φ(x)i
∇ A(θ) = R φ(x) e dµ(x) − R
ehθ,φ(x)i dµ(x) ( ehθ,φ(x)i dµ(x))2
= Eθ [φ(X)φ(X)> ] − Eθ [φ(X)]Eθ [φ(X)]>
= Covθ (φ(X)).
Minimal exponential families (Definition 3.2) also enjoy a few additional regularity properties.
Recall that A is strictly convex if
Proposition 3.2.3. Let {Pθ } be a regular exponential family. The log partition function A is
strictly convex if and only if {Pθ } is minimal.
Proof If the family is minimal, then Varθ (u> φ(X)) > 0 for any vector u, while Varθ (u> φ(X)) =
u> ∇2 A(θ)u. This implies the strict positive definiteness ∇2 A(θ) 0, which is equivalent to strict
convexity (see Corollary B.3.2 in Appendix B.3.1). Conversely, if ∇2 A(θ) 0 for all θ ∈ Θ, then
Varθ (u> φ(X)) > 0 for all u 6= 0 and so u> φ(x) is non-constant in x.
49
Lexture Notes on Statistics and Information Theory John Duchi
This is always a convex optimization problem (see Appendices B and C for much more on this), as A
is convex and the first term is linear, and so has no non-global optima. Here and throughout, as we
mention in the introductory remarks to this chapter, we treat convex optimization as a technology:
as long as the dimension of a problem is not too large and its objective can be evaluated, it is
(essentially) computationally trivial.
Of course, we never have access to the population P fully; instead, we receive a sample
X1 , . . . , Xn from P . In this case, a natural approach is to replace the expected (negative) log
likelihood above with its empirical version and solve
n
X n
X
minimize − log pθ (Xi ) = [−hθ, φ(Xi )i + A(θ)], (3.2.2)
θ
i=1 i=1
which is still a convex optimization problem (as the objective is convex in θ). The maximum
likelihood estimate is any vector θbn minimizing the negative log likelihood (3.2.2), which by setting
gradients to 0 is evidently any vector satisfying
n
1X
∇A(θbn ) = Eθbn [φ(X)] = φ(Xi ). (3.2.3)
n
i=1
In particular, we need only find a parameter θbn matching moments of the empirical distribution
of the observed Xi ∼ P . This θbn is unique whenever Covθ (φ(X)) 0 for all θ, that is, when
the covariance of φ is full rank in the exponential family model, because then the objective in the
minimization problem (3.2.2) is strictly convex.
Let us proceed heuristically for a moment to develop a rough convergence guarantee for the
estimator θbn ; the next paragraph assumes a comfort with some of classical asymptotic statistics
(and the central limit theorem) and is not essential for what comes later. Then we can see how
minimizers of the problem (3.2.2) converge to their population counterparts. Assume that the data
50
Lexture Notes on Statistics and Information Theory John Duchi
Xi are i.i.d. from an exponential family model Pθ? . Then we expect that the maximum likelihood
estimate θbn should converge to θ? , and so
n
1X
φ(Xi ) = ∇A(θbn ) = ∇A(θ? ) + (∇2 A(θ? ) + o(1))(θbn − θ? ).
n
i=1
But of course, ∇A(θ? ) = Eθ? [φ(X)], and so the central limit theorem gives that
n
1X ·
(φ(Xi ) − ∇A(θ? )) ∼ N 0, n−1 Covθ? (φ(X)) = N 0, n−1 ∇2 A(θ? ) ,
n
i=1
·
where ∼ means “is approximately distributed as.” Multiplying by (∇2 A(θ? )+o(1))−1 ≈ ∇2 A(θ? )−1 ,
we thus see (still working in our heuristic)
n
1 X
θbn − θ? = (∇2 A(θ? ) + o(1))−1 (φ(Xi ) − ∇A(θ? ))
n
i=1
· −1 2 ? −1
∼ N 0, n · ∇ A(θ ) , (3.2.4)
where we use that BZ ∼ N(0, BΣB > ) if Z ∼ N(0, Σ). (It is possible to make each of these steps
fully rigorous.) Thus the cumulant generating function A governs the error we expect in θbn − θ? .
Much of the rest of this book explores properties of these types of minimization problems: at
what rates do we expect θbn to converge to a global minimizer of problem (3.2.1)? Can we show
that these rates are optimal? Is this the “right” strategy for choosing a parameter? Exponential
families form a particular working example to motivate this development.
Similarly, we have
Dkl (Pθ+∆ ||Pθ ) = Eθ+∆ [hθ + ∆, φ(X)i − A(θ + ∆) − hθ, φ(X)i + A(θ)]
= A(θ) − A(θ + ∆) + Eθ+∆ [h∆, φ(X)i]
= A(θ) − A(θ + ∆) − ∇A(θ + ∆)> (−∆).
These identities give an immediate connection with convexity. Indeed, for a differentiable convex
function h, the first-order divergence associated with h is
which is always nonnegative, and is the gap between the linear approximation to the (convex)
function h and its actual value. In much of the statistical and machine learning literature, the
51
Lexture Notes on Statistics and Information Theory John Duchi
divergence (3.3.1) is called a Bregman divergence, though we will use the more evocative first-
order divergence. These will appear frequently throughout the book and, more generally, appear
frequently in work on optimization and statistics.
JCD Comment: Put in a picture of a Bregman divergence
When the perturbation ∆ is small, that A is infinitely differentiable then gives that
1
Dkl (Pθ ||Pθ+∆ ) = ∆> ∇2 A(θ)∆ + O(k∆k3 ),
2
so that the Hessian ∇2 A(θ) tells quite precisely how the KL divergence changes as θ varies (locally).
As we saw already in Example 2.3.2 (and see the next section), when the KL-divergence between
two distributions is small, it is hard to test between them, and in the sequel, we will show converses
to this. The Hessian ∇2 A(θ? ) also governs the error in the estimate θbn − θ? in our heuristic (3.2.4).
When the Hessian ∇2 A(θ) is quite positive semidefinite, the KL divergence Dkl (Pθ ||Pθ+∆ ) is large,
and the asymptotic covariance (3.2.4) is small. For this—and other reasons we address later—for
exponential family models, we call
52
Lexture Notes on Statistics and Information Theory John Duchi
The log partition function A(· | x) provides the same insights for the conditional models (3.4.1)
as it does for the unconditional exponential family models in the preceding sections. Indeed, as
in Propositions 3.2.1 and 3.2.2, the log partition A(· | x) is always C ∞ on its domain and convex.
Moreover, it gives the expected moments of the sufficient statistic φ conditional on x, as
from which we can (typically) extract the mean or other statistics of Y conditional on x.
Three standard examples will be our most frequent motivators throughout this book: linear
regression, binary logistic regression, and multiclass logistic regression. We give these three, as
well as describing two more important examples involving modeling count data through Poisson
regression and making predictions for targets y known to live in a bounded set.
so that we have the exponential family representation (3.4.1) with φ(x, y) = σ12 xy, h(y) =
exp(− 2σ1 2 y 2 + 21 log(2πσ 2 )), and A(θ) = 2σ1 2 θ> xx> θ. As ∇A(θ | x) = Eθ [φ(X, Y ) | X = x] =
1
σ2
xEθ [Y | X = x], we easily recover Eθ [Y | X = x] = θ> x. 3
Frequently, we wish to predict binary or multiclass random variables Y . For example, consider
a medical application in which we wish to assess the probability that, based on a set of covariates
x ∈ Rd (say, blood pressure, height, weight, family history) and individual will have a heart attack
in the next 5 years, so that Y = 1 indicates heart attack and Y = −1 indicates not. The next
example shows how we might model this.
Example 3.4.2 (Binary logistic regression): If Y ∈ {−1, 1}, we model
exp(yx> θ)
pθ (y | x) = ,
1 + exp(yx> θ)
where the idea in the probability above is that if x> θ has the same sign as y, then the large
x> θy becomes the higher the probability assigned the label y; when x> θy < 0, the probability
is small. Of course, we always have pθ (y | x) + pθ (−y | x) = 1, and using the identity
y+1 >
yx> θ − log(1 + exp(yx> θ)) = x θ − log(1 + exp(x> θ))
2
53
Lexture Notes on Statistics and Information Theory John Duchi
y+1
we obtain the generalized linear model representation φ(x, y) = 2 x and A(θ | x) = log(1 +
exp(x> θ)).
As an alternative, we could represent Y ∈ {0, 1} by
exp(yx> θ)
> x> θ
pθ (y | x) = = exp yx θ − log(1 + e ) ,
1 + exp(x> θ)
which has the simpler sufficient statistic φ(x, y) = xy. 3
Instead of a binary prediction problem, in many cases we have a multiclass prediction problem,
where we seek to predict a label Y for an object x belonging to one of k different classes. For
example, in image recognition, we are given an image x and wish to identify the subject Y of the
image, where Y ranges over k classes, such as birds, dogs, cars, trucks, and so on. This too we can
model using exponential families.
Example 3.4.3 (Multiclass logistic regression): In the case that we have a k-class prediction
problem in which we wish to predict Y ∈ {1, . . . , k} from X ∈ Rd , we assign parameters
θy ∈ Rd to each of the classes y = 1, . . . , k. We then model
> k
exp(θy x)
X
>
pθ (y | x) = Pk = exp θy> x − log eθj x .
>
j=1 exp(θj x) j=1
Here, the idea is that if θy> x > θj> x for all j 6= y, then the model assigns higher probability to
class y than any other class; the larger the gap between θy> x and θj> x, the larger the difference
in assigned probabilities. 3
Other approaches with these ideas allow us to model other situations. Poisson regression models
are frequent choices for modeling count data. For example, consider an insurance company that
wishes to issue premiums for shipping cargo in different seasons and on different routes, and so
wishes to predict the number of times a given cargo ship will be damaged by waves over a period
of service; we might represent this with a feature vector x encoding information about the ship to
be insured, typical weather on the route it will take, and the length of time it will be in service.
To model such counts Y ∈ {0, 1, 2, . . .}, we turn to Poisson regression.
Example 3.4.4 (Poisson regression): When Y ∈ N is a count, the Poisson distribution with
−λ y >
rate λ > 0 gives P (Y = y) = e y!λ . Poisson regression models λ via eθ x , giving model
1 >
pθ (y | x) = exp yx> θ − eθ x ,
y!
so that we have carrier h(y) = 1/y! and the simple sufficient statistic yx> θ. The log partition
>
function is A(θ | x) = eθ x . 3
Lastly, we consider a less standard example, but which highlights the flexibility of these models.
Here, we assume a linear regression problem but in which we wish to predict values Y in a bounded
range.
Example 3.4.5 (Bounded range regression): Suppose that we know Y ∈ [−b, b], but we wish
to model it via an exponential family model with density
54
Lexture Notes on Statistics and Information Theory John Duchi
While its functional form makes this highly non-obvious, our general results guarantee that
A(θ | x) is indeed C ∞ and convex in θ. We have ∇A(θ | x) = xEθ [Y | X = x] because
φ(x, y) = xy, and we can therefore immediately recover Eθ [Y | X = x]. Indeed, set s = θ> x,
and without loss of generality assume s 6= 0. Then
∂ ebs − e−bs b(ebs + e−bs ) 1
E[Y | x> θ = s] = log = bs − ,
∂s s e − e−bs s
which increases from −b to b as s = x> θ increases from −∞ to +∞. 3
Once again, to find θbn amounts to matching moments, as ∇A(θ | Xi ) = E[φ(X, Y ) | X = Xi ], and
we still enjoy the convexity properties of the standard exponential family models.
In general, we of course do not expect any exponential family or generalized linear model (GLM)
to have perfect fidelity to the world: all models are in accurate (but many are useful!). Nonetheless,
we can still fit any of the GLM models in Examples 3.4.1–3.4.5 to data of the appropriate type. In
particular, for the logarithmic loss `(θ; x, y) = − log pθ (y | x), we can define the empirical loss
n
1X
Ln (θ) := `(θ; Xi , Yi ).
n
i=1
Then, as n → ∞, we expect that Ln (θ) → E[`(θ; X, Y )], so that the minimizing θ should give the
best predictions possible according to the loss `. We shall therefore often be interested in such
convergence guarantees and the deviations of sample quantities (like Ln ) from their population
counterparts.
55
Lexture Notes on Statistics and Information Theory John Duchi
H0 : θ = θ0 versus H1,n : θ = θn
as n grows, where we observe a sample X1n drawn i.i.d. either according to Pθ0 (i.e., H0 ) or Pθn
(i.e., H1,n ). By choosing θn in a way that makes the separation v > (θn − θ0 ) large but testing H0
against H1,n challenging, we can then (roughly) identify the separation δ at which testing becomes
impossible.
Proposition 3.5.1. Let θ0 ∈ Rd . Then there exists a sequence of parameters θn with kθn − θ0 k =
√
O(1 n), separation
1
q
v > (θn − θ0 ) = √ v > ∇2 A(θ0 )−1 v,
n
and for which
1
inf {Pθ0 (Ψ(X1n ) 6= 0) + Pθn (Ψ(X1n ) 6= 1)} ≥ + O(n−1/2 ).
Ψ 2
Proof Let ∆ ∈ Rd be a potential perturbation to θ1 = θ0 + ∆, which gives separation δ =
v > θ1 − v > θ0 = v > ∆. Let P0 = Pθ0 and P1 = Pθ1 . Then the smallest summed probability of error
in testing between P0 and P1 based on n observations X1n is
by Proposition 2.3.1. Following the approach of Example 2.3.2, we apply Pinsker’s inequal-
ity (2.2.10) and use that the KL-divergence tensorizes to find
2 kP0n − P1n k2TV ≤ nDkl (P0 ||P1 ) = nDkl (Pθ0 ||Pθ0 +∆ ) = nDA (θ0 + ∆, θ0 ),
where the final equality follows from the equivalence between KL and first-order divergences for
exponential families (Proposition 3.3.1).
56
Lexture Notes on Statistics and Information Theory John Duchi
To guarantee that the summed probability of error is at least 21 , that is, kP0n − P1n kTV ≤ 12 ,
it suffices to choose ∆ satisfying nDA (θ0 + ∆, θ0 ) ≤ 21 . So to maximize the separation v > ∆ while
guaranteeing a constant probability of error, we (approximately) solve
maximize v > ∆
1
subject to DA (θ0 + ∆, θ0 ) ≤ 2n .
3
Now, consider that DA (θ0 + ∆, θ0 ) = 12 ∆> ∇2 A(θ0 )∆ + O(k∆k ). Ignoring the higher order term,
we consider maximizing v > ∆ subject to ∆> ∇2 A(θ0 )∆ ≤ n1 . A Lagrangian calculation shows that
this has solution
1 1
∆= √ p ∇2 A(θ0 )−1 v.
n v > ∇2 A(θ0 )−1 v
p
With this choice, we have separation δ = v > ∆ = v > ∇2 A(θ0 )−1 v/n, and DA (θ0 + ∆, θ0 ) =
1 3/2
2n + O(1/n ). The summed probability of error is at least
r r
n 1 1
n n
1 − kP0 − P1 kTV ≥ 1 − + O(n −1/2 )=1− + O(n−1/2 ) = + O(n−1/2 )
4n 4 2
as desired.
Let us briefly sketch out why Proposition 3.5.1 is the “right” answer using the heuristics in Sec-
tion 3.2.1. For an unknown parameter θ in the exponential family model Pθ , we observe X1 , . . . , Xn ,
and wish to test whether v > θ ≥ t for a given threshold t. Call our null H0 : v > θ ≤ t, and assume
we wish to test at an asymptotic level α > 0, meaning the probability the test falsely rejects H0 is
(as n → ∞) is at most α. Assuming the heuristic (3.2.4), we have the approximate distributional
equality
· 1
v > θbn ∼ N v > θ, v > ∇2 A(θbn )−1 v .
n
Note that we have θbn on the right side of the distribution; it is possible to make this rigorous, but
here we target only intuition building. A natural asymptotically level α test is then
( q
Reject if v > θbn ≥ t + z1−α v > ∇2 A(θbn )−1 v/n
Tn :=
Accept otherwise,
where z1−α is the 1 − α quantile of a standard normal, P(Z ≥ z1−α ) = α for Z ∼ N(0, 1). Let θ0
be such that v > θ0 = t, so H0 holds. Then
√
q
> b > 2 −1
Pθ0 (Tn rejects) = Pθ0 n · v (θn − θ0 ) ≥ z1−α v ∇ A(θn ) v → α.
b
p √
At least heuristically, then, this separation δ = v > A(θ0 )−1 v/ n is the fundamental separation
in parameter values at which testing becomes possible (or below which it is impossible).
As a brief and suggestive aside, the precise growth of the KL-divergence Dkl (Pθ0 +∆ ||Pθ0 ) =
1 > 2 3
2 ∆ ∇ A(θ0 )∆ + O(k∆k ) near θ0 plays the fundamental role in both the lower bound and upper
bound on testing. When the Hessian ∇2 A(θ0 ) is “large,” meaning it is very positive definite,
distributions with small parameter distances are still well-separated in KL-divergence, making
testing easy, while when ∇2 A(θ0 ) is small (nearly indefinite), the KL-divergence can be small even
for large parameter separations ∆ and testing is hard. As a consequence, at least for exponential
family models, the Fisher information (3.3.2), which we defined as ∇2 A(θ) = Covθ (φ(X)), plays a
central role in testing and, as we see later, estimation.
57
Lexture Notes on Statistics and Information Theory John Duchi
Lemma 3.6.1. Consider any collection {θ1 , . . . , θm } ⊂ Θ, and let Θ0 = Conv{θi }m i=1 and C ⊂
int Θ0 . Then for any k ∈ N, there exists a constant K = K(C, k, {θi }) such that for all θ0 ∈ C,
Proof Let B = {u ∈ Rd | kuk ≤ 1} be the unit ball in Rd . For any > 0, there exists a K = K()
such that kxkk ≤ Kekxk for all x ∈ Rd . As C ⊂ int Conv(Θ0 ), there exists an > 0 such P that for
all θ0 ∈ C, θ0 + 2B ⊂ Θ0 , and by construction, for any u ∈ B we can write θ0 + 2u = m j=1 λj θj
m >
for some λ ∈ R+ with 1 λ = 1. We therefore have
But using the convexity of t 7→ exp(t) and that θ0 + 2u ∈ Θ0 , the last quantity has upper bound
Lemma 3.6.2. Under the conditions of Lemma 3.6.1, there exists a K such that for any θ, θ0 ∈ C
Proof We write
exp(hθ, xi) − exp(hθ0 , xi) exp(hθ − θ0 , xi) − 1
= exp(hθ0 , xi)
kθ − θ0 k kθ − θ0 k
1
For complex functions, Osgood’s lemma shows that if A is continuous and holomorphic in each variable individ-
ually, it is holomorphic. For a treatment of such ideas in an engineering context, see, e.g. [92, Ch. 1].
58
Lexture Notes on Statistics and Information Theory John Duchi
From this, we can assume without loss of generality that θ0 = 0 (by shifting). Now note that
by convexity e−a ≥ 1 − a for all a ∈ R, so 1 − ea ≤ |a| when a ≤ 0. Conversely, if a > 0, then
d
aea ≥ ea − 1 (note that da (aea ) = aea + ea ≥ ea ), so dividing by kxk, we see that
as desired.
With the lemmas in hand, we can demonstrate a dominating function for the derivatives. Indeed,
fix θ0 ∈ int Θ and for θ ∈ Θ, define
exp(hθ, xi) − exp(hθ0 , xi) − exp(hθ0 , xi)hx, θ − θ0 i ehθ,xi − ehθ0 ,xi − h∇ehθ0 ,xi , θ − θ0 i
g(θ, x) = = .
kθ − θ0 k kθ − θ0 k
Then limθ→θ0 g(θ, x) = 0 by the differentiability of t 7→ et . Lemmas 3.6.1 and 3.6.2 show that if
we take any collection {θj }m
j=1 ⊂ Θ for which θ ∈ int Conv{θj }, then for C ⊂ int Conv{θj }, there
exists a constant K such that
| exp(hθ, xi) − exp(hθ0 , xi)|
|g(θ, x)| ≤ + kxk exp(hθ0 , xi) ≤ K max exp(hθj , xi)
kθ − θ0 k j
Pm
for all θ ∈ C. As maxj ehθj ,xi dµ(x) ≤ hθj ,xi dµ(x) < ∞, the dominated convergence
R R
j=1 e
theorem thus implies that Z
lim g(θ, x)dµ(x) = 0,
θ→θ0
√
Analyticity Over the subset ΘC := {θ + iz | θ ∈ Θ, z ∈ Rd } (where i = −1 is the imaginary
unit), we can extend the preceding results to demonstrate that A is analytic on ΘC . Indeed, we
first simply note that for a, b ∈ R, exp(a + ib) = exp(a) exp(ib) and | exp(a + ib)| = exp(a), i.e.
|ez | = e z for z ∈ C, and so Lemmas 3.6.1 and 3.6.2 follow mutatis-mutandis as in the real case.
These are enough for the application of the dominated convergence theorem above, and we use that
exp(·) is analytic to conclude that θ 7→ M (θ) is analytic on ΘC .
59
Lexture Notes on Statistics and Information Theory John Duchi
3.7 Bibliography
3.8 Exercises
Exercise 3.1: In Example 3.1.4, give the sufficient statistic φ and an explicit formula for the log
partition function A(θ, Θ) so that we can write pθ,Θ (x) = exp(hθ, φ1 (x)i + hΘ, φ2 (x)i − A(θ, Θ)).
Exercise 3.2: Consider the binary logistic regression model in Example 3.4.2, and let `(θ; x, y) =
− log pθ (y | x) be the associated log loss.
(ii) Let (xi , yi )ni=1 ⊂ Rd × {±1} be a sample. Give a sufficient condition for the minimizer of the
empirical log loss
n
1X
Ln (θ) := `(θ; xi , yi )
n
i=1
to be unique that depends only on the vectors {xi }. Hint. A convex function h is strictly
convex if and only if its Hessian ∇2 h is positive definite.
60
Part I
61
Chapter 4
Concentration Inequalities
In many scenarios, it is useful to understand how a random variable X behaves by giving bounds
on the probability that it deviates far from its mean or median. This can allow us to give prove
that estimation and learning procedures will have certain performance, that different decoding and
encoding schemes work with high probability, among other results. In this chapter, we give several
tools for proving bounds on the probability that random variables are far from their typical values.
We conclude the section with a discussion of basic uniform laws of large numbers and applications
to empirical risk minimization and statistical learning, though we focus on the relatively simple
cases we can treat with our tools.
P(X ≥ t)?
We begin with the three most classical three inequalities for this purpose: the Markov, Chebyshev,
and Chernoff bounds, which are all instances of the same technique.
The basic inequality off of which all else builds is Markov’s inequality.
Proposition 4.1.1 (Markov’s inequality). Let X be a nonnegative random variable, meaning that
X ≥ 0 with probability 1. Then
E[X]
P(X ≥ t) ≤ .
t
Proof For any random variable, P(X ≥ t) = E[1 {X ≥ t}] ≤ E[(X/t)1 {X ≥ t}] ≤ E[X]/t, as
X/t ≥ 1 whenever X ≥ t.
When we know more about a random variable than that its expectation is finite, we can give
somewhat more powerful bounds on the probability that the random variable deviates from its
typical values. The first step in this direction, Chebyshev’s inequality, requires two moments, and
when we have exponential moments, we can give even stronger results. As we shall see, each of
these results is but an application of Proposition 4.1.1.
62
Lexture Notes on Statistics and Information Theory John Duchi
Proposition 4.1.2 (Chebyshev’s inequality). Let X be a random variable with Var(X) < ∞. Then
Var(X) Var(X)
P(X − E[X] ≥ t) ≤ and P(X − E[X] ≤ −t) ≤
t2 t2
for all t ≥ 0.
Proof We prove only the upper tail result, as the lower tail is identical. We first note that
X − E[X] ≥ t implies that (X − E[X])2 ≥ t2 . But of course, the random variable Z = (X − E[X])2
is nonnegative, so Markov’s inequality gives P(X − E[X] ≥ t) ≤ P(Z ≥ t2 ) ≤ E[Z]/t2 , and
E[Z] = E[(X − E[X])2 ] = Var(X).
E[eλX ]
P(X ≥ t) ≤ = ϕX (λ)e−λt
eλt
for all λ ≥ 0.
Proof This is another application of Markov’s inequality: for λ > 0, we have eλX ≥ eλt if and
only if X ≥ t, so that P(X ≥ t) = P(eλX ≥ eλt ) ≤ E[eλX ]/eλt .
In particular, taking the infimum over all λ ≥ 0 in Proposition 4.1.3 gives the more standard
Chernoff (large deviation) bound
P(X ≥ t) ≤ exp inf log ϕX (λ) − λt .
λ≥0
λ2 σ 2
ϕX (λ) = E[exp(λX)] = exp . (4.1.1)
2
63
Lexture Notes on Statistics and Information Theory John Duchi
As a consequence of the equality (4.1.1) and the Chernoff bound technique (Proposition 4.1.3),
we see that for X Gaussian with variance σ 2 , we have
t2 t2
P(X ≥ E[X] + t) ≤ exp − 2 and P(X ≤ E[X] − t) ≤ exp − 2
2σ 2σ
λ2 σ 2 2 2 2
for all t ≥ 0. Indeed, we have log ϕX−E[X] (λ) = 2 , and inf λ { λ 2σ − λt} = − 2σ
t
2 , which is
attained by λ = σt2 . 3
Example 4.1.5 (Random signs (Rademacher variables)): The random variable X taking
values {−1, 1} with equal property is 1-sub-Gaussian. Indeed, we have
∞ ∞ ∞ ∞
1 X λk 1 X (−λ)k λ2k (λ2 )k
2
1 1 X X λ
E[exp(λX)] = eλ + e−λ = + = ≤ = exp ,
2 2 2 k! 2 k! (2k)! 2k k! 2
k=0 k=0 k=0 k=0
as claimed. 3
Bounded random variables are also sub-Gaussian; indeed, we have the following example.
Example 4.1.6 (Bounded random variables): Suppose that X is bounded, say X ∈ [a, b].
Then Hoeffding’s lemma states that
λ2 (b − a)2
E[eλ(X−E[X]) ] ≤ exp ,
8
so that X is (b − a)2 /4-sub-Gaussian.
We prove a somewhat weaker statement with a simpler argument, while Exercise 4.1 gives one
approach to proving the above statement. First, let ε ∈ {−1, 1} be a Rademacher variable,
so that P(ε = 1) = P(ε = −1) = 12 . We apply a so-called symmetrization technique—a
common technique in probability theory, statistics, concentration inequalities, and Banach
space research—to give a simpler bound. Indeed, let X 0 be an independent copy of X, so that
E[X 0 ] = E[X]. We have
= E exp(λε(X − X 0 )) ,
64
Lexture Notes on Statistics and Information Theory John Duchi
where the inequality follows from Jensen’s inequality and the last equality is a conseqence of
the fact that X − X 0 is symmetric about 0. Using the result of Example 4.1.5,
λ (X − X 0 )
2 2
λ (b − a)2
0
E exp(λε(X − X )) ≤ E exp ≤ exp ,
2 2
where the final inequality is immediate from the fact that |X − X 0 | ≤ b − a. 3
While Example 4.1.6 shows how a symmetrization technique can give sub-Gaussian behavior,
more sophisticated techniques involving explicitly bounding the logarithm of the moment generating
function of X, often by calculations involving exponential tilts of its density. In particular, letting
X be mean zero for simplicity, if we let
then
E[XeλX ] E[X 2 eλX ] E[XeλX ]2
ψ 0 (λ) = and ψ 00
(λ) = − ,
E[eλX ] E[eλX ] E[eλX ]2
where we can interchange the order of taking expectations and derivatives whenever ψ(λ) is finite.
Notably, if X has density pX (with respect to any base measure) then the random variable Yλ with
density
eλy
pλ (y) = pX (y)
E[eλX ]
(with respect to the same base measure) satisfies
One can exploit this in many ways, which the exercises and coming chapters do. As a particular
example, we can give sharper sub-Gaussian constants for Bernoulli random variables.
Example 4.1.7 (Bernoulli random variables): Let X be Bernoulli(p), so that X = 1 with
probability p and X = 0 otherwise. Then a strengthening of Hoeffding’s lemma (also, essen-
tially, due to Hoeffding) is that
σ 2 (p) 2 1 − 2p
log E[eλ(X−p) ] ≤ λ for σ 2 (p) := .
2 2 log 1−p
p
Here we take the limits as p → {0, 21 , 1} and have σ 2 (0) = 0, σ 2 (1) = 0, and σ 2 ( 12 ) = 14 .
Because p 7→ σ 2 (p) is concave and symmetric about p = 12 , this inequality is always sharper
than that of Example 4.1.6. Exercise 4.9 gives one proof of this bound exploiting exponential
tilting. 3
Chernoff bounds for sub-Gaussian random variables are immediate; indeed, they have the same
concentration properties as Gaussian random variables, a consequence of the nice analytical prop-
erties of their moment generating functions (that their logarithms are at most quadratic). Thus,
using the technique of Example 4.1.4, we obtain the following proposition.
Proposition 4.1.8. Let X be a σ 2 -sub-Gaussian. Then for all t ≥ 0 we have
t2
P(X − E[X] ≥ t) ∨ P(X − E[X] ≤ −t) ≤ exp − 2 .
2σ
65
Lexture Notes on Statistics and Information Theory John Duchi
Chernoff bounds extend naturally to sums of independent random variables, because moment
generating functions of sums of independent random variables become products of moment gener-
ating functions.
Proof We assume w.l.o.g. that the Xi are mean zero. We have by independence that and
sub-Gaussianity that
Xn n−1
X 2 2 n−1
X
λ σn
E exp λ Xi = E exp λ Xi E[exp(λXn )] ≤ exp E exp λ Xi .
2
i=1 i=1 i=1
Two immediate corollary to Propositions 4.1.8 and 4.1.9 show that sums of sub-Gaussian random
variables concentrate around their expectations. We begin with a general concentration inequality.
Corollary 4.1.10. Let Xi be independent σi2 -sub-Gaussian random variables. Then for all t ≥ 0
( n n )
t2
X X
max P (Xi − E[Xi ]) ≥ t , P (Xi − E[Xi ]) ≤ −t ≤ exp − Pn .
i=1 i=1
2 i=1 σi2
Additionally, the classical Hoeffding bound, follows when we couple Example 4.1.6 with Corol-
lary 4.1.10: if Xi ∈ [ai , bi ], then
n
2t2
X
P (Xi − E[Xi ]) ≥ t ≤ exp − Pn 2
.
i=1 i=1 (bi − ai )
To give another interpretation of these inequalities, let us assume that Xi are indepenent and
σ 2 -sub-Gaussian. Then we have that
n
nt2
X
1
P (Xi − E[Xi ]) ≥ t ≤ exp − 2 ,
n 2σ
i=1
q
1
nt2 2σ 2 log δ
or, for δ ∈ (0, 1), setting exp(− 2σ 2) = δ or t = √
n
, we have that
q
1X
n 2σ 2 log 1δ
(Xi − E[Xi ]) ≤ √ with probability at least 1 − δ.
n n
i=1
There are a variety of other conditions equivalent to sub-Gaussianity, which we capture in the
following theorem.
66
Lexture Notes on Statistics and Information Theory John Duchi
Theorem 4.1.11. Let X be a mean-zero random variable and σ 2 ≥ 0 be a constant. The following
statements are all equivalent, meaning that there are numerical constant factors Kj such that if one
statement (i) holds with parameter Ki , then statement (j) holds with parameter Kj ≤ CKi , where
C is a numerical constant.
2
(1) Sub-gaussian tails: P(|X| ≥ t) ≤ 2 exp(− Kt1 σ2 ) for all t ≥ 0.
√
(2) Sub-gaussian moments: E[|X|k ]1/k ≤ K2 σ k for all k.
Particularly,
q (1) implies (2) with K1 = 1 and K2 ≤ e1/e ; (2) implies (3) with K2 = 1 and
2
K3 = e e−1 < 3; (3) implies (4) with K3 = 1 and K4 ≤ 43 ; and (4) implies (1) with K4 = 12 and
K1 ≤ 2.
This result is standard in the literature on concentration and random variables, but see Ap-
pendix 4.5.1 for a proof of this theorem.
For completeness, we can give a tighter result than part (3) of the preceding theorem, giving a
concrete upper bound on squares of sub-Gaussian random variables. The technique used in the ex-
ample, to introduce an independent random variable for auxiliary randomization, is a common and
useful technique in probabilistic arguments (similar to our use of symmetrization in Example 4.1.6).
√ (i) (ii) 1
E[exp(λX 2 )] = E[exp( 2λXZ)] ≤ E exp(λσ 2 Z 2 ) =
1 ,
[1 − 2σ 2 λ]+2
where inequality (i) follows because X is sub-Gaussian, and inequality (ii) because Z ∼ N(0, 1).
3
67
Lexture Notes on Statistics and Information Theory John Duchi
where inequality (i) holds for λ ≤ 14 , because − log(1 − 2λ) ≤ 2λ + 4λ2 for λ ≤ 14 . 3
As a second example, we can show that bounded random variables are sub-exponential. It is
clear that this is the case as they are also sub-Gaussian; however, in many cases, it is possible to
show that their parameters yield much tighter control over deviations than is possible using only
sub-Gaussian techniques.
Example 4.1.14 (Bounded random variables are sub-exponential): Suppose that X is a
mean zero random variable taking values in [−b, b] with variance σ 2 = E[X 2 ] (note that we are
guaranteed that σ 2 ≤ b2 in this case). We claim that
2 2
3λ σ 1
E[exp(λX)] ≤ exp for |λ| ≤ . (4.1.4)
5 2b
To see this, note first that for k ≥ 2 we have E[|X|k ] ≤ E[X 2 bk−2 ] = σ 2 bk−2 . Then by an
expansion of the exponential, we find
∞ ∞
λ2 E[X 2 ] X λk E[X k ] λ2 σ 2 X λk σ 2 bk−2
E[exp(λX)] = 1 + E[λX] + + ≤1+ +
2 k! 2 k!
k=3 k=3
∞
λ2 σ 2 X (λb)k (i) λ2 σ 2 λ2 σ 2
=1+ + λ2 σ 2 ≤ 1+ + ,
2 (k + 2)! 2 10
k=1
1
inequality (i) holding for λ ≤ 2b . Using that 1 + x ≤ ex gives the result.
It is possible to give a slightly tighter result for λ ≥ 0 In this case, we have the bound
∞
λ2 σ 2 2 2
X λk−2 bk−2 σ2
E[exp(λX)] ≤ 1 + +λ σ = 1 + 2 eλb − 1 − λb .
2 k! b
k=3
68
Lexture Notes on Statistics and Information Theory John Duchi
Then using that 1 + x ≤ ex , we obtain Bennett’s moment generating inequality, which is that
2
λX σ λb
E[e ] ≤ exp e − 1 − λb for λ ≥ 0. (4.1.5)
b2
λ2 b2
Inequality (4.1.5) always holds, and for λb near 0, we have eλb − 1 − λb ≈ 2 . 3
In particular, if the variance σ 2 b2 , the absolute bound on X, inequality (4.1.4) gives much
tighter control on the moment generating function of X than typical sub-Gaussian bounds based
only on the fact that X ∈ [−b, b] allow.
More broadly, we can show a result similar to Theorem 4.1.11.
Theorem 4.1.15. Let X be a random variable and σ ≥ 0. Then—in the sense of Theorem 4.1.11—
the following statements are all equivalent for suitable numerical constants K1 , . . . , K4 .
(4) If, in addition, E[X] = 0, then E[exp(λX)] ≤ exp(K4 λ2 σ 2 ) for all |λ| ≤ K40 /σ.
In particular, if (2) holds with K2 = 1, then (4) holds with K4 = 2e2 and K40 = 1
2e .
The proof, which is similar to that for Theorem 4.1.11, is presented in Section 4.5.2.
While the concentration properties of sub-exponential random variables are not quite so nice
as those for sub-Gaussian random variables (recall Hoeffding’s inequality, Corollary 4.1.10), we
can give sharp tail bounds for sub-exponential random variables. We first give a simple bound on
deviation probabilities.
Proposition 4.1.16. Let X be a mean-zero (τ 2 , b)-sub-exponential random variable. Then for all
t ≥ 0, 2
1 t t
P(X ≥ t) ∨ P(X ≤ −t) ≤ exp − min , .
2 τ2 b
Proof The proof is an application of the Chernoff bound technique; we prove only the upper tail
as the lower tail is similar. We have
E[eλX ] (i)
2 2
λ τ
P(X ≥ t) ≤ ≤ exp − λt ,
eλt 2
inequality (i) holding for |λ| ≤ 1/b. To minimize the last term in λ, we take λ = min{ τt2 , 1/b},
which gives the result.
Comparing with sub-Gaussian random variables, which have b = 0, we see that Proposition 4.1.16
gives a similar result for small t—essentially the same concentration sub-Gaussian random variables—
while for large t, the tails decrease only exponentially in t.
We can also give a tensorization identity similar to Proposition 4.1.9.
69
Lexture Notes on Statistics and Information Theory John Duchi
Proof We apply an inductive technique similar to that used in the proof of Proposition 4.1.9.
1
First, for any fixed i, we know that if |λ| ≤ bi |ai|
, then |ai λ| ≤ b1i and so
λ2 a2i σi2
E[exp(λai Xi )] ≤ exp .
2
1
Now, we inductively apply the preceding inequality, which applies so long as |λ| ≤ bi |ai | for all i.
We have
n n n
" X # Y 2 2 2
Y λ ai σi
E exp λ ai Xi = E[exp(λai Xi )] ≤ exp ,
2
i=1 i=1 i=1
It is instructive to study the structure of the bound of Corollary 4.1.18. Notably, the bound
is similar to the Hoeffding-type bound of Corollary 4.1.10 (holding for σ 2 -sub-Gaussian random
variables) that
n
!
t2
X
P ai Xi ≥ t ≤ exp − ,
i=1
2 kak22 σ 2
so that for small t, Corollary 4.1.18 gives sub-Gaussian tail behavior. For large t, the bound is
weaker. However, in many cases, Corollary 4.1.18 can give finer control than naive sub-Gaussian
bounds. Indeed, suppose that the random variables Xi are i.i.d., mean zero, and satisfy Xi ∈ [−b, b]
with probability 1, but have variance σ 2 = E[Xi2 ] ≤ b2 as in Example 4.1.14. Then Corollary 4.1.18
implies that
n
( )!
5 t2
X
1 t
P ai Xi ≥ t ≤ exp − min , . (4.1.6)
2 6 σ 2 kak22 2b kak∞
i=1
70
Lexture Notes on Statistics and Information Theory John Duchi
When applied to a standard mean (and with a minor simplification that 5/12 < 1/3) with ai = n1 ,
t2
we obtain the bound that n1 ni=1 Xi ≤ t with probability at least 1−exp(−n min{ 3σ t
P
2 , 4b }). Written
q
3 log 1δ 4b log 1δ
differently, we take t = max{σ n , n } to obtain
q
1
n
X 3 log 1δ 4b log 1
δ
Xi ≤ max σ √ , with probability 1 − δ.
n n n
i=1
q √
The sharpest such bound possible via more naive Hoeffding-type bounds is b 2 log 1δ / n, which
has substantially worse scaling.
Lemma 4.1.19. Let X be a random variable satisfying the Bernstein condition (4.1.7). Then
λ2 σ 2
h
λ(X−µ)
i 1
E e ≤ exp for |λ| ≤ .
2(1 − b|λ|) b
√
Said differently, a random variable satisfying Condition (4.1.7) is ( 2σ, b/2)-sub-exponential.
Proof Without loss of generality we assume µ = 0. We expand the moment generating function
by noting that
∞ ∞
λX λ2 σ 2 X λk E[X k ] (i) λ2 σ 2 λ2 σ 2 X
E[e ]=1+ + ≤ 1+ + |λb|k−2
2 k! 2 2
k=3 k=3
λ2 σ 2 1
=1+
2 [1 − b|λ|]+
where inequality (i) used the Bernstein condition (4.1.7). Noting that 1+x ≤ ex gives the result.
As one final example, we return to Bennett’s inequality (4.1.5) from Example 4.1.14.
71
Lexture Notes on Statistics and Information Theory John Duchi
Proof We assume without loss of generality that E[X] = 0. Using the standard Chernoff bound
argument coupled with inequality (4.1.5), we see that
n n
! !
X X X σi2 λb
P Xi ≥ t ≤ exp e − 1 − λb − λt .
b2
i=1 i=1
Letting h(t) = (1 + t) log(1 + t) − t as in the statement of the proposition and σ 2 = ni=1 σi2 , we
P
minimize over λ ≥ 0, setting λ = 1b log(1 + σbt2 ). Substituting into our Chernoff bound application
gives the proposition.
A slightly more intuitive writing of Bennett’s inequality is to use averages, in which case for
σ 2 = n1 ni=1 σi2 the average of the variances,
P
n
!
nσ 2
1X bt
P Xi ≥ t ≤ exp − h .
n b σ2
i=1
That this is a norm is not completely trivial, though a few properties are immediate: clearly
kaXkψ = |a| kXkψ , and we have kXkψ = 0 if and only if X = 0 with probability 1. The key result
is that in fact, k·kψ is actually convex, which then guarantees that it is a norm.
Proposition 4.1.21. The function k·kψ is convex on the space of random variables.
Proof Because ψ is convex and non-decreasing, x 7→ ψ(|x|) is convex as well. (Convince yourself
of this.) Thus, its perspective transform pers(ψ)(t, |x|) := tψ(|x|/t) is jointly convex in both t ≥ 0
and x (see Appendix B.3.3). This joint convexity of ψe implies that for any random variables X0
and X1 and t0 , t1 ,
E[pers(ψ)(λt0 + (1 − λ)t1 , |λX0 + (1 − λ)X1 |)] ≤ λE[pers(ψ)(t0 , |X0 |)] + (1 − λ)E[pers(ψ)(t1 , |X1 |)].
72
Lexture Notes on Statistics and Information Theory John Duchi
We also have what we term the sub-Gaussian and sub-Exponential norms, typically denoted by
considering the functions
ψp (x) := exp (|x|p ) − 1.
These induce the Orlicz ψp -norms, as for p ≥ 1, these are convex (as they are the composition of the
increasing convex function exp(·) applied to the nonnegative convex function | · |p ). Theorem 4.1.11
shows that we have a natural sub-Gaussian norm
while Theorem 4.1.15 shows a natural sub-exponential norm (or Orlicz ψ1 -norm)
Many relationships follow immediately from the definitions (4.1.10) and (4.1.11). For example,
any sub-Gaussian random variable (whether or not it is mean zero) has a square that is sub-
exponential:
(This is immediate by definition.) By tracing through the arguments in the proofs of Theo-
rems 4.1.11 and 4.1.15, we can also see that an alternative definition of the two norms could
be
1 1
sup √ E[|X|k ]1/k and sup E[|X|k ]1/k
k∈N k k∈N k
for the sub-Gaussian and sub-exponential norms kXkψ2 and kXkψ1 , respectively. They are all
equivalent.
73
Lexture Notes on Statistics and Information Theory John Duchi
dimension while preserving essential aspects of the dataset. This line of research begins with Indyk
and Motwani [112], and continuing through a variety of other works, including Indyk [111] and
work on locality-sensitive hashing by Andoni et al. [6], among others. The original approach is due
to Johnson and Lindenstrauss, who used the results in the study of Banach spaces [117]; our proof
follows a standard argument.
The most specific variant of this problem is as follows: we have n points u1 , . . . , un , and we
could like to construct a mapping Φ : Rd → Rm , where m d, such that
kΦui − Φuj k2 ∈ (1 ± ) kui − uj k2 .
Depending on the norm chosen, this task may be impossible; for the Euclidean (`2 ) norm, however,
such an embedding is easy to construct using Gaussian random variables and with m = O( 12 log n).
This embedding is known as the Johnson-Lindenstrauss embedding. Note that this size m is
independent of the dimension d, only depending on the number of points n.
Example 4.1.23 (Johnson-Lindenstrauss): Let the matrix Φ ∈ Rm×d be defined as follows:
iid
Φij ∼ N(0, 1/m),
and let Φi ∈ Rd denote the ith row of this matrix. We claim that
8 1
m ≥ 2 2 log n + log implies kΦui − Φuj k22 ∈ (1 ± ) kui − uj k22
δ
log n
for all pairs ui , uj with probability at least 1 − δ. In particular, m & 2
is sufficient to achieve
accurate dimension reduction with high probability.
To see this, note that for any fixed vector u,
m
hΦi , ui kΦuk22 X
∼ N(0, 1/m), and = hΦi , u/ kuk2 i2
kuk2 kuk22 i=1
is a sum of independent scaled χ2 -random variables. In particular, we have E[kΦu/ kuk2 k22 ] = 1,
and using the χ2 -concentration result of Example 4.1.13 yields
P kΦuk22 / kuk22 − 1 ≥ = P m kΦuk22 / kuk22 − 1 ≥ m
m2
2
≤ 2 inf exp 2mλ − λm = 2 exp − ,
|λ|≤ 41 8
the last inequality holding for ∈ [0, 1]. Now, using the union bound applied to each of the
pairs (ui , uj ) in the sample, we have
m2
2 2 2
n
P there exist i 6= j s.t. kΦ(ui − uj )k2 − kui − uj k2 ≥ kui − uj k2 ≤ 2 exp − .
2 8
2
Taking m ≥ 82 log nδ = 16
2
log n + 82 log 1δ yields that with probability at least 1 − δ, we have
kΦui − Φuj k2 ∈ (1 ± ) kui − uj k22 . 3
2
74
Lexture Notes on Statistics and Information Theory John Duchi
the maximum likelihood decoder. We now investigate how to choose a collection {x1 , . . . , xm }
of such codewords and give finite sample bounds on its probability of error. In fact, by using
concentration inequalities, we can show that a randomly drawn codebook of fairly small dimension
is likely to enjoy good performance.
Intuitively, if our codebook {x1 , . . . , xm } ⊂ {0, 1}d is well-separated, meaning that each pair of
words xi , xk satisfies kxi − xk k1 ≥ cd for some numerical constant c > 0, we should be unlikely to
make a mistake. Let us make this precise. We mistake word i for word k only if the received signal
Z satisfies kZ − xi k1 ≥ kZ − xk k1 , and letting J = {j ∈ [d] : xij 6= xkj } denote the set of at least
c · d indices where xi and xk differ, we have
X
kZ − xi k1 ≥ kZ − xk k1 if and only if |Zj − xij | − |Zj − xkj | ≥ 0.
j∈J
If xi is the word being sent and xi and xk differ in position j, then |Zj − xij | − |Zj − xkj | ∈ {−1, 1},
and is equal to −1 with probability (1 − ) and 1 with probability . That is, we have kZ − xi k1 ≥
kZ − xk k1 if and only if
X
|Zj − xij | − |Zj − xkj | + |J|(1 − 2) ≥ |J|(1 − 2) ≥ cd(1 − 2),
j∈J
and the expectation EQ [|Zj − xij | − |Zj − xkj | | xi ] = −(1 − 2) when xij 6= xkj . Using the Hoeffding
bound, then, we have
where we have used that there are at least |J| ≥ cd indices differing between xi and xk . The
probability of making a mistake at all is thus at most m exp(− 12 cd(1 − 2)2 ) if our codebook has
separation c · d.
For low error decoding to occur with extremely high probability, it is thus sufficient to choose
a set of code words {x1 , . . . , xm } that is well separated. To that end, we state a simple lemma.
75
Lexture Notes on Statistics and Information Theory John Duchi
76
Lexture Notes on Statistics and Information Theory John Duchi
Definition 4.3. Let M1 , M2 , . . . be an R-valued sequence of random variables. They are a martin-
gale if there exist another sequence of random variables {Z1 , Z2 , . . .} ⊂ Z and sequence of functions
fn : Z n → R such that
E[Mn | Z1n−1 ] = Mn−1 and Mn = fn (Z1n )
for all n ∈ N. We say that the sequence Mn is adapted to {Zn }.
for all n ∈ N.
There are numerous examples of martingale sequences. The classical one is the symmetric
random walk.
Example 4.2.1: Let Dn ∈ {±1} be uniform and independent. Then Dn form a martingale
difference sequence adapted to themselves (that is, we may take Zn = Dn ), and Mn = ni=1 Di
P
is a martingale. 3
A more sophisticated example, to which we will frequently return and that suggests the potential
usefulness of martingale constructions, is the Doob martingale associated with a function f .
by the tower property of expectations. Thus, the Di satisfy Definition 4.4 of a martingale
difference sequence, and moreover, we have
n
X
Di = f (X1n ) − E[f (X1n )],
i=1
and so the Doob martingale captures exactly the difference between f and its expectation. 3
77
Lexture Notes on Statistics and Information Theory John Duchi
because D1 , . . . , Dn−1 are functions of Z1n−1 . Then we use Definition 4.5, which implies that
2 2
E[eλDn | Z1n−1 ] ≤ eλ σn /2 , and we obtain
"n−1 # 2 2
Y
λDi λ σn
E[exp(λMn )] ≤ E e exp .
2
i=1
as desired.
The second claims are simply applications of Chernoff bounds via Proposition 4.1.8 and that
E[Mn ] = 0.
78
Lexture Notes on Statistics and Information Theory John Duchi
Corollary 4.2.4. LetPDi be a bounded difference martingale difference sequence, meaning that
|Di | ≤ c. Then Mn = ni=1 Di satisfies
t2
−1/2 −1/2
P(n Mn ≥ t) ∨ P(n Mn ≤ −t) ≤ exp − 2 for t ≥ 0.
2c
√
Thus, bounded random walks are (with high probability) within ± n of their expectations after
n steps.
There exist extensions of these inequalities to the cases where we control the variance of the
martingales; see Freedman [87].
The classical inequality relating bounded differences and concentration is McDiarmid’s inequal-
ity, or the bounded differences inequality.
Proposition 4.2.5 (Bounded differences inequality). Let f : X n → R satisfy bounded Pdifferences
with constants ci , and let Xi be independent random variables. f (X1n ) − E[f (X1n )] is 14 ni=1 c2i -sub-
Gaussian, and
2t2
n n n n
P (f (X1 ) − E[f (X1 )] ≥ t) ∨ P (f (X1 ) − E[f (X1 )] ≤ −t) ≤ exp − Pn 2 .
i=1 ci
Proof The basic idea is to show that the Doob martingale (Example 4.2.2) associated with f is
c2i /4-sub-Gaussian, and then to simply apply the Azuma-Hoeffding P inequality. To that end, define
Di = E[f (X1n ) | X1i ] − E[f (X1n ) | X1i−1 ] as before, and note that ni=1 Di = f (X1n ) − E[f (X1n )]. The
random variables
Ui − Li ≤ sup sup E[f (X1n ) | X1i−1 = x1i−1 , Xi = x] − E[f (X1n ) | X1i−1 = x1i−1 , Xi = x0 ]
xi−1 x,x0
1
Z
i−1 0 n
f (xi−1 n n
= sup sup 1 , x, xi+1 ) − f (x1 , x , xi+1 ) dP (xi+1 ) ≤ ci ,
xi−1 x,x0
1
where we have used the independence of the Xi and Definition 4.6 of bounded differences. Conse-
quently, we have by Hoeffding’s Lemma (Example 4.1.6) that E[eλDi | X1i−1 ] ≤ exp(λ2 c2i /8), that
is, the Doob martingale is c2i /4-sub-Gaussian.
79
Lexture Notes on Statistics and Information Theory John Duchi
A number of quantities satisfy the conditions of Proposition 4.2.5, and we give two examples
here; we will revisit them more later.
Example 4.2.6 (Bounded random vectors): Let B be a Banach space—a complete normed
vector space—with norm k·k. Let Xi be independent bounded random vectors in B satisfying
E[Xi ] = 0 and kXi k ≤ c. We claim that the quantity
n
1X
f (X1n ) := Xi
n
i=1
i−1 0 n 1 2c
|f (xi−1 n
1 , x, xi+1 ) − f (x1 , x , xi+1 )| ≤ x − x0 ≤ .
n n
Consequently, if Xi are indpendent, we have
n n
!
nt2
1X 1X
P Xi − E Xi ≥ t ≤ 2 exp − 2 (4.2.1)
n n 2c
i=1 i=1
for all t ≥ 0. That is, the norm of (bounded) random vectors in an essentially arbitrary vector
space concentrates extremely quickly about its expectation.
The challenge becomes to control the expectation term in the concentration bound (4.2.1),
which can be a bit challenging. In certain cases—for example, when we have a Euclidean
structure on the vectors Xi —it can be easier. Indeed, let us specialize to the case that Xi ∈ H,
a (real) Hilbert space, so that there is an inner product h·, ·i and the norm satisfies kxk2 = hx, xi
for x ∈ H. Then Cauchy-Schwarz implies that
Xn 2 X n 2 X Xn
E Xi ≤E Xi = E[hXi , Xj i] = E[kXi k2 ].
i=1 i=1 i,j i=1
That is assuming the Xi are independent and E[kXi k2 ] ≤ σ 2 , inequality (4.2.1) becomes
nt2
σ σ
P X n ≥ √ + t + P X n ≤ − √ − t ≤ 2 exp − 2
n n 2c
where X n = n1 ni=1 Xi . 3
P
We can specialize Example 4.2.6 to a situation that is very important for treatments of concen-
tration, sums of random vectors, and generalization bounds in machine learning.
Example 4.2.7 (Rademacher complexities): This example is actually a special case of Ex-
ample 4.2.6, but its frequent uses justify a more specialized treatment and consideration. Let
X be some space, and let F be some collection of functions f : X → R. Let εi ∈ {−1, 1} be a
collection of independent random sign vectors. Then the empirical Rademacher complexity of
F is
n
" #
1 X
Rn (F | xn1 ) := E sup εi f (xi ) ,
n f ∈F i=1
80
Lexture Notes on Statistics and Information Theory John Duchi
(b1 −b0 )2
Consequently, the empirical Rademacher complexity satisfies Rn (F | X1n ) − Rn (F) is 4n -
sub-Gaussian by Theorem 4.2.3. 3
These examples warrant more discussion, and it is possible to argue that many variants of these
random variables are well-concentrated. For example, instead of functions we may simply consider
an arbitrary set A ⊂ Rn and define the random variable
n
X
Z(A) := supha, εi = sup ai εi .
a∈A a∈A i=1
As a function of the random signs εi , we may write Z(A) = f (ε), and this is then a function
satisfying |f (ε) − f (ε0 )| ≤ supa∈A |ha, ε − ε0 i|, so that if ε and ε0 differ in index i, we have |f (ε) −
f (ε0 )| ≤ 2 supa∈A |ai |. That is, Z(A) − E[Z(A)] is ni=1 supa∈A |ai |2 -sub-Gaussian.
P
as elements of this vector space L. (Here we have used 1Xi to denote the point mass at Xi .)
Then the Rademacher complexity is nothing more than the expected norm of Pn0 , a random
vector, as in Example 4.2.6. This view is somewhat sophisticated, but it shows that any general
results we may prove about random vectors, as in Example 4.2.6, will carry over immediately
to versions of the Rademacher complexity. 3
81
Lexture Notes on Statistics and Information Theory John Duchi
denote the empirical distribution on {Xi }ni=1 , where 1Xi denotes the point mass at Xi . Then for
functions f : X → R (or more generally, any function f defined on X ), we let
n
1X
Pn f := EPn [f (X)] = f (Xi )
n
i=1
denote the empirical expectation of f evaluated on the sample, and we also let
Z
P f := EP [f (X)] = f (x)dP (x)
denote general expectations under a measure P . With this notation, we study uniform laws of
large numbers, which consist of proving results of the form
where convergence is in probability, expectation, almost surely, or with rates of convergence. When
we view Pn and P as (infinite-dimensional) vectors on the space of maps from F → R, then we
may define the (semi)norm k·kF for any L : F → R by
kPn − P kF → 0.
Thus, roughly, we are simply asking questions about when random vectors converge to their expec-
tations.1
The starting point of this investigation considers bounded random functions, that is, F consists
of functions f : X → [a, b] for some −∞ < a ≤ b < ∞. In this case, the bounded differences
inequality (Proposition 4.2.5) immediately implies that expectations of kPn − P kF provide strong
guarantees on concentration of kPn − P kF .
1
Some readers may worry about measurability issues here. All of our applications will be in separable spaces,
so that we may take suprema with abandon without worrying about measurability, and consequently we ignore this
from now on.
82
Lexture Notes on Statistics and Information Theory John Duchi
by the triangle inequality. An entirely parallel argument gives the converse lower bound of − b−a
n ,
and thus Proposition 4.2.5 gives the result.
Proposition 4.3.1 shows that, to provide control over high-probability concentration of kPn − P kF ,
it is (at least in cases where F is bounded) sufficient to control the expectation E[kPn − P kF ]. We
take this approach through the remainder of this section, developing tools to simplify bounding
this quantity.
Our starting points consist of a few inequalities relating expectations to symmetrized quantities,
which are frequently easier to control than their non-symmetrized parts. This symmetrization
technique is widely used in probability theory, theoretical statistics, and machine learning. The key
is that for centered random variables, symmetrized quantities have, to within numerical constants,
similar expectations to their non-symmetrized counterparts. Thus, in many cases, it is equivalent
to analyze the symmetized quantity and the initial quantity.
Proposition 4.3.2. Let Xi be independent random vectors on a (Banach) space with norm k·k
and let εi {−1, 1} be independent random signs. Then for any p ≥ 1,
" n # " n # " n #
X p X p X p
2−p E εi (Xi − E[Xi ]) ≤E (Xi − E[Xi ]) ≤ 2p E εi Xi
i=1 i=1 i=1
In the proof of the upper bound, we could also show the bound
" n # " n #
X p X p
E (Xi − E[Xi ]) ≤ 2p E εi (Xi − E[Xi ]) ,
i=1 i=1
dist
Now, note that the distribution of Xi − Xi0 is symmetric, so that Xi − Xi0 = εi (Xi − Xi0 ), and thus
" n # " n #
X p X p
E (Xi − E[Xi ]) ≤E εi (Xi − Xi0 ) .
i=1 i=1
83
Lexture Notes on Statistics and Information Theory John Duchi
as desired.
For the left bound in the proposition, let Yi = Xi − E[Xi ] be the centered version of the random
variables. We break the sum over random variables into two parts, conditional on whether εi = ±1,
using repeated conditioning. We have
" n # " #
X p X X p
E εi Yi =E Yi − Yi
i=1 i:εi =1 i:ε=−1
" " # " ##
X p X p
≤ E 2p−1 E Yi | ε + 2p−1 E Yi |ε
i:εi =1 i:εi −1
" " # " ##
X X p X X p
p−1
=2 E E Yi + E[Yi ] |ε +E Yi + E[Yi ] |ε
i:εi =1 i:εi =−1 i:εi =−1 i:εi =1
" " # " ##
X X p X X p
p−1
≤2 E E Yi + Yi |ε +E Yi + Yi |ε
i:εi =1 i:εi =−1 i:εi =−1 i:εi =1
n
" #
X p
= 2p E Yi .
i=1
The expectation of Pn0 F is of course the Rademacher complexity (Examples 4.2.7 and 4.2.8), and
we have the following corollary.
From Corollary 4.3.3, it is evident that by controlling the expectation of the symmetrized process
E[kPn0 kF ] we can derive concentration inequalities and uniform laws of large numbers. For example,
we immediately obtain that
2nt2
0
P kPn − P kF ≥ 2E[kPn kF ] + t ≤ exp −
(b − a)2
84
Lexture Notes on Statistics and Information Theory John Duchi
A refinement of Massart’s finite class bound applies when the classes are infinite but, on a
collection X1 , . . . , Xn , the functions f ∈ F may take on only a (smaller) number of values. In this
case, we define the empirical shatter coefficient of a collection of points x1 , . . . , xn by SF (xn1 ) :=
card{(f (x1 ), . . . , f (xn )) | f ∈ F }, the number of distinct vectors of values (f (x1 ), . . . , f (xn )) the
functions f ∈ F may take. The shatter coefficient is the maximum of the empirical shatter coeffi-
cients over xn1 ∈ X n , that is, SF (n) := supxn1 SF (xn1 ). It is clear that SF (n) ≤ |F| always, but by
only counting distinct values, we have the following corollary.
Corollary 4.3.5 (A sharper variant of Massart’s finite class bound). Let F be any collection of
functions with f : X → R, and assume that σn2 := n−1 E[maxf ∈F ni=1 f (Xi )2 ] < ∞. Then
P
p
2σn2 log SF (n)
Rn (F) ≤ √ .
n
Typical classes with small shatter coefficients include Vapnik-Chervonenkis classes of functions; we
do not discuss these further here, instead referring to one of the many books in machine learning
and empirical process theory in statistics.
The most important of the calculus rules we use are the comparison inequalities for Rademacher
sums, which allow us to consider compositions of function classes and maintain small complexity
measurers. We state the rule here; the proof is complex, so we defer it to Section 4.5.3
Theorem 4.3.6 (Ledoux-Talagrand Contraction). Let T ⊂ Rn be an arbitrary set and let φi : R →
R be 1-Lipschitz and satisfy φi (0) = 0. Then for any nondecreasing convex function Φ : R → R+ ,
n
" !#
1 X
E Φ sup φi (ti )εi ≤ E Φ supht, εi .
2 t∈T t∈T
i=1
85
Lexture Notes on Statistics and Information Theory John Duchi
86
Lexture Notes on Statistics and Information Theory John Duchi
δ/2
Figures 4.1 and 4.2 give examples of (respectively) a covering and a packing of the same set.
An exercise in proof by contradiction shows that the packing and covering numbers of a set are
in fact closely related:
Lemma 4.3.8. The packing and covering numbers satisfy the following inequalities:
M (2δ, Θ, ρ) ≤ N (δ, Θ, ρ) ≤ M (δ, Θ, ρ).
We leave derivation of this lemma to Exercise 4.11, noting that it shows that (up to constant factors)
packing and covering numbers have the same scaling in the radius δ. As a simple example, we see
for any interval [a, b] on the real line that in the usual absolute distance metric, N (δ, [a, b], | · |)
(b − a)/δ.
As one example of the metric entropy, consider a set of functions F with reasonable covering
numbers (metric entropy) in k·k∞ -norm.
Example 4.3.9 (The “standard” covering number guarantee): Let F consist of functions
f : X → [−b, b] and let the metric ρ be kf − gk∞ = supx∈X |f (x) − g(x)|. Then
!
nt2
P sup |Pn f − P f | ≥ t ≤ exp − + log N (t/3, F, k·k∞ ) . (4.3.2)
f ∈F 18b2
87
Lexture Notes on Statistics and Information Theory John Duchi
So as long as the covering numbers N (t, F, k·k∞ ) grow sub-exponentially in t—so that log N (t)
nt2 —we have the (essentially) sub-Gaussian tail bound (4.3.2). Example 4.4.11 gives one typ-
ical case. Indeed, fix a minimal t/3-cover of F in k·k∞ of size N := N (t/3, F, k·k∞ ), call-
ing the covering functions f1 , . . . , fN . Then for any f ∈ F and the function fi satisfying
kf − fi k∞ ≤ t/2, we have
2t
|Pn f − P f | ≤ |Pn f − Pn fi | + |Pn fi − P fi | + |P fi − P f | ≤ |Pn fi − P fi | + .
3
The Azuma-Hoeffding inequality (Theorem 4.2.3) guarantees (by a union bound) that
nt2
P max |Pn fi − P fi | ≥ t ≤ exp − 2 + log N .
i≤N 2b
Combine this bound (replacing t with t/3) to obtain inequality (4.3.2). 3
Given the relationships between packing, covering, and size of sets Θ, we would expect there
to be relationships between volume, packing, and covering numbers. This is indeed the case, as we
now demonstrate for arbitrary norm balls in finite dimensions.
Lemma 4.3.10. Let B denote the unit k·k-ball in Rd . Then
d
2 d
1
≤ N (δ, B, k·k) ≤ 1 + .
δ δ
Proof We prove the lemma via a volumetric argument. For the lower bound, note that if the
points v1 , . . . , vN are a δ-cover of B, then
N
X
Vol(B) ≤ Vol(δB + vi ) = N Vol(δB) = N Vol(B)δ d .
i=1
In particular, N ≥ δ −d . For the upper bound on N (δ, B, k·k), let V be a δ-packing of B with
maximal cardinality, so that |V| = M (δ, B, k·k) ≥ N (δ, B, k·k) (recall Lemma 4.3.8). Notably, the
collection of δ-balls {δB + vi }M
i=1 cover the ball B (as otherwise, we could put an additional element
in the packing V), and moreover, the balls { 2δ B + vi } are all disjoint by definition of a packing.
Consequently, we find that
d
δ d
δ δ δ
M Vol(B) = M Vol B ≤ Vol B + B = 1 + Vol(B).
2 2 2 2
Rewriting, we obtain
d
δ d Vol(B) 2 d
2
M (δ, B, k·k) ≤ 1+ = 1+ ,
δ 2 Vol(B) δ
completing the proof.
Let us give one application of Lemma 4.3.10 to concentration of random matrices; we explore
more in the exercises as well. We can generalize the definition of sub-Gaussian random variables
to sub-Gaussian random vectors, where we say that X ∈ Rd is a σ 2 -sub-Gaussian vector if
2
σ 2
E[exp(hu, X − E[X]i)] ≤ exp kuk2 (4.3.3)
2
88
Lexture Notes on Statistics and Information Theory John Duchi
for all u ∈ Rd . For example, X ∼ N(0, Id ) is immediately 1-sub-Gaussian, and X ∈ [−b, b]d with
independent entries is b2 -sub-Gaussian. Now, suppose that Xi are independent isotropic random
vectors, meaning that E[Xi ] = 0, E[Xi Xi> ] = Id , and that they are also σ 2 -sub-Gaussian. Then by
an application of Lemma 4.3.10, we can give concentration guarantees for the sample covariance
Σn := n1 ni=1 Xi Xi> for the operator norm kAkop := sup{hu, Avi | kuk2 = kvk2 = 1}.
P
Proposition 4.3.11. Let Xi be independent isotropic and σ 2 -sub-Gaussian vectors. Then there is
a numerical constant C such that the sample covariance Σn := n1 ni=1 Xi Xi> satisfies
P
s
1 1
d + log d + log
kΣn − Id kop ≤ Cσ 2 δ
+ δ
n n
Proof The second inequality is trivial. Fix any u ∈ Bd2 . Then for the i such that ku − ui k2 ≤ ,
we have
hu, Aui = hu − ui , Aui + hui , Aui = 2hu − ui , Aui + hui , Aui i ≤ 2 kAkop + hui , Aui i
by definition of the operator norm. Taking a supremum over u gives the final result.
Let the matrix Ei = Xi Xi> − I, and define the average error E n = n1 Ei . Then with this lemma
in hand, we see that for any -cover N of the `2 -ball Bd2 ,
(1 − 2) E n op
≤ maxhu, E n ui.
u∈N
89
Lexture Notes on Statistics and Information Theory John Duchi
In general, however, we only have access to the risk via the empirical distribution of the Zi , and
we often choose f by minimizing the empirical risk
n
b n (f ) := 1
X
L `(f, Zi ). (4.4.2)
n
i=1
As written, this formulation is quite abstract, so we provide a few examples to make it somewhat
more concrete.
Example 4.4.1 (Binary classification problems): One standard problem—still abstract—
that motivates the formulation (4.4.1) is the binary classification problem. Here the data Zi
come in pairs (X, Y ), where X ∈ X is some set of covariates (independent variables) and
Y ∈ {−1, 1} is the label of example X. The function class F consists of functions f : X → R,
and the goal is to find a function f such that
P(sign(f (X)) 6= Y )
is small, that is, minimizing the risk E[`(f, Z)] where the loss is the 0-1 loss, `(f, (x, y)) =
1 {f (x)y ≤ 0}. 3
In this case, the loss function is the zero-one loss `(f, (x, y)) = 1 {maxl6=y fl (x) ≥ fy (x)}. 3
Example 4.4.3 (Binary classification with linear functions): In the standard statistical
learning setting, the data x belong to Rd , and we assume that our function class F is indexed
by a set Θ ⊂ Rd , so that F = {fθ : fθ (x) = θ> x, θ ∈ Θ}. In this case, we may use the zero-one
loss,
the convex hinge loss, or the (convex) logistic loss, which are variously `zo (fθ , (x, y)) :=
>
1 yθ x ≤ 0 , and the convex losses
h i
`hinge (fθ , (x, y)) = 1 − yx> θ and `logit (fθ , (x, y)) = log(1 + exp(−yx> θ)).
+
The hinge and logistic losses, as they are convex, are substantially computationally easier to
work with, and they are common choices in applications. 3
90
Lexture Notes on Statistics and Information Theory John Duchi
The main motivating question that we ask is the following: given a sample Z1 , . . . , Zn , if we
choose some fbn ∈ F based on this sample, can we guarantee that it generalizes to unseen data? In
particular, can we guarantee that (with high probability) we have the empirical risk bound
n
b n (fbn ) = 1
X
L `(fbn , Zi ) ≤ R(fbn ) + (4.4.3)
n
i=1
for some small ? If we allow fbn to be arbitrary, then this becomes clearly impossible: consider the
classification example 4.4.1, and set fbn to be the “hash” function that sets fbn (x) = y if the pair
(x, y) was in the sample, and otherwise fbn (x) = −1. Then clearly L b n (fbn ) = 0, while there is no
useful bound on R(fbn ).
for all f ∈ F. (Recall that the risk functional L(f ) = EP [`(f, Z)].) For example, if the loss is the
zero-one loss from classification problems, inequality (4.4.4) is satisfied with σ 2 = 14 by Hoeffding’s
lemma. In order to guarantee a bound of the form (4.4.4) for a function fb chosen dependent on
the data, in this section we give uniform bounds, that is, we would like to bound
!
P there exists f ∈ F s.t. L(f ) > Lb n (f ) + t or P sup L b n (f ) − R(f ) > t .
f ∈F
Such uniform bounds are certainly sufficient to guarantee that the empirical risk is a good proxy
for the true risk L, even when fbn is chosen based on the data.
Now, recalling that our set of functions or predictors F is finite or countable, let us suppose
that for each f ∈ F, we have a complexity measure c(f )—a penalty—such that
X
e−c(f ) ≤ 1. (4.4.5)
f ∈F
This inequality should look familiar to the Kraft inequality—which we will see in the coming
chapters—from coding theory. As soon as we have such a penalty function, however, we have the
following result.
Theorem 4.4.4. Let the loss `, distribution P on Z, and function class F be such that `(f, Z) is
σ 2 -sub-Gaussian for each f ∈ F, and assume that the complexity inequality (4.4.5) holds. Then
with probability at least 1 − δ over the sample Z1:n ,
s
1
b n (f ) + 2σ 2 log δ + c(f ) for all f ∈ F.
L(f ) ≤ L
n
91
Lexture Notes on Statistics and Information Theory John Duchi
Proof First, we note that by the usual sub-Gaussian concentration inequality (Corollary 4.1.10)
we have for any t ≥ 0 and any f ∈ F that
2
nt
P L(f ) ≥ L b n (f ) + t ≤ exp − .
2σ 2
p
Now, if we replace t by t2 + 2σ 2 c(f )/n, we obtain
nt2
p
2 2
P L(f ) ≥ Ln (f ) + t + 2σ c(f )/n ≤ exp − 2 − c(f ) .
b
2σ
Then using a union bound, we have
nt2
X
p
2 2
P ∃ f ∈ F s.t. L(f ) ≥ Ln (f ) + t + 2σ c(f )/n ≤
b exp − 2 − c(f )
2σ
f ∈F
nt2 X
= exp − 2 exp(−c(f )) .
2σ
f ∈F
| {z }
≤1
As one classical example of this setting, suppose that we have a finite class of functions F. Then
we can set c(f ) = log |F|, in which case we clearly have the summation guarantee (4.4.5), and we
obtain s
1
L(f ) ≤ Lb n (f ) + 2σ 2 log δ + log |F| uniformly for f ∈ F
n
with probability at least 1 − δ. To make this even more concrete, consider the following example.
Example 4.4.5 (Floating point classifiers): We implement a linear binary classifier using
double-precision floating point values, that is, we have fθ (x) = θ> x for all θ ∈ Rd that may
be represented using d double-precision floating point numbers. Then for each coordinate of
θ, there are at most 264 representable numbers;
> in total, we must thus have |F| ≤ 264d . Thus,
for the zero-one loss `zo (fθ , (x, y)) = 1 θ xy ≤ 0 , we have
s
1
b n (fθ ) + log δ + 45d
L(fθ ) ≤ L
2n
for all representable classifiers simultaneously, with probability at least 1 − δ, as the zero-one
loss is 1/4-sub-Gaussian. (Here we have used that 64 log 2 < 45.) 3
We also note in passing that by replacing δ with δ/2 in the bounds of Theorem 4.4.4, a union
bound yields the following two-sided corollary.
Corollary 4.4.6. Under the conditions of Theorem 4.4.4, we have
s
2
Lb n (f ) − L(f ) ≤ 2σ 2 log δ + c(f ) for all f ∈ F
n
with probability at least 1 − δ.
92
Lexture Notes on Statistics and Information Theory John Duchi
Example 4.4.7 (Rademacher complexity of the `2 -ball): Let Θ = {θ ∈ Rd | kθk2 ≤ r}, and
consider the class of linear functionals F := {fθ (x) = θT x, θ ∈ Θ}. Then
v
u n
ru X
n
Rn (F | x1 ) ≤ t kxi k22 ,
n
i=1
because we have
v " v
n n u n
" # #
u 2
r X ru X ru X
Rn (F | xn1 ) = E εi x i ≤ t E εi x i = t kxi k22 ,
n 2 n 2 n
i=1 i=1 i=1
as desired. 3
Example 4.4.8 (Rademacher complexity of the `1 -ball): In contrast to the previous example,
suppose that Θ = {θ ∈ Rd | kθk1 ≤ r}, and consider the linear class F := {fθ (x) = θT x, θ ∈ Θ}.
Then
" n #
r X
Rn (F | xn1 ) = E εi x i .
n ∞ i=1
Now, each coordinate j of ni=1 εi xi is ni=1 x2ij -sub-Gaussian, and thus using that E[maxj≤d Zj ] ≤
P P
p
2σ 2 log d for arbitrary σ 2 -sub-Gaussian Zj (see Exercise 4.7), we have
v
u n
n r u X
Rn (F | x1 ) ≤ t2 log(2d) max x2ij .
n j
i=1
These examples are sufficient to derive a few sophisticated risk bounds. We focus on the case
where we have a loss function applied to some class with reasonable Rademacher complexity, in
93
Lexture Notes on Statistics and Information Theory John Duchi
which case it is possible to recenter the loss class and achieve reasonable complexity bounds. The
coming proposition does precisely this in the case of margin-based binary classification. Consider
points (x, y) ∈ X × {±1}, and let F be an arbitrary class of functions f : X → R and L =
{(x, y) 7→ `(yf (x))}f ∈F be the induced collection of losses. As a typical example, we might have
`(t) = [1 − t]+ , `(t) = e−t , or `(t) = log(1 + e−t ). We have the following proposition.
Proposition 4.4.9. Let F and X be such that supx∈X |f (x)| ≤ M for f ∈ F and assume that
` is L-Lipschitz. Define the empirical and population risks L b n (f ) := Pn `(Y f (X)) and L(f ) :=
P `(Y f (X)). Then
!
2
P sup |L b n (f ) − L(f )| ≥ 4LRn (F) + t ≤ 2 exp − nt for t ≥ 0.
f ∈F 2L2 M 2
Proof We may recenter the class L, that is, replace `(·) with `(·) − `(0), without changing
b n (f ) − L(f ). Call this class L0 , so that kPn − P k = kPn − P k . This recentered class satisfies
L L L0
bounded differences with constant 2M L, as |`(yf (x)) − `(y 0 f (x0 ))| ≤ L|yf (x) − y 0 f (x0 )| ≤ 2LM ,
as in the proof of Proposition 4.3.1. Applying Proposition 4.3.1 and then Corollary 4.3.3 and gives
that P(supf ∈F |L b n (f ) − L(f )| ≥ 2Rn (L0 ) + t) ≤ exp(− nt22 2 ) for t ≥ 0. Then applying the con-
2M L
traction inequality (Theorem 4.3.6) yields Rn (L0 ) ≤ 2LRn (F), giving the result.
Example 4.4.10 (Support vector machines and hinge losses): In the support vector machine
problem, we receive data (Xi , Yi ) ∈ Rd × {±1}, and we seek to minimize average of the losses
`(θ; (x, y)) = 1 − yθT x + . We assume that the space X has kxk2 ≤ b for x ∈ X and that
nt2
P sup |Pn `(θ; (X, Y )) − P `(θ; (X, Y ))| ≥ 4Rn (FΘ ) + t ≤ exp − 2 2 ,
θ∈Θ 2r b
where FΘ = {fθ (x) = θT x}θ∈Θ . Now, we apply Example 4.4.7, which implies that
2rb
Rn (φ ◦ FΘ ) ≤ 2Rn (Fθ ) ≤ √ .
n
nt2
4rb
P sup |Pn `(θ; (X, Y )) − P `(θ; (X, Y ))| ≥ √ + t ≤ exp − ,
θ∈Θ n 2(rb)2
√
so that Pn and P become close at rate roughly rb/ n in this case. 3
94
Lexture Notes on Statistics and Information Theory John Duchi
When we do not have the simplifying structure of `(yf (x)) identified in the preceding examples,
we can still provide guarantees of generalization using the covering number guarantees introduced
in Section 4.3.2. The most common and important case is when we have a Lipschitzian loss function
in an underlying parameter θ.
Example 4.4.11 (Lipschitz functions over a norm-bounded parameter space): Consider the
parametric loss minimization problem
for a loss function ` that is M -Lipschitz (with respect to the norm k·k) in its argument, where
for normalization we assume inf θ∈Θ `(θ, z) = 0 for each z. Then the metric entropy of Θ
bounds the metric entropy of the loss class F := {z 7→ `(θ, z)}θ∈Θ for the supremum norm
k·k∞ . Indeed, for any pair θ, θ0 , we have
Assume that Θ ⊂ {θ | kθk ≤ b} for some finite b. Then Lemma 4.3.10 guarantees that
log N (, Θ, k·k) ≤ d log(1 + 2/) . d log 1 , and so the classical covering number argument in
Example 4.3.9 gives
nt2
M
P sup |Pn `(θ, Z) − P `(θ, Z)| ≥ t ≤ exp −c 2 2 + Cd log ,
θ∈Θ b M t
2 2d
where c, C are numerical constants. In particular, taking t2 M nb log nδ gives that
M b d log nδ
p
|Pn `(θ, Z) − P `(θ, Z)| . √
n
with probability at least 1 − δ. 3
L∗ = inf L(f ),
f
where the preceding infimum is taken across all (measurable) functions. Then we have
95
Lexture Notes on Statistics and Information Theory John Duchi
error, but may have substantial approximation error. With that in mind, we would like to develop
procedures that, rather than simply attaining good performance for the class F, are guaranteed
to trade-off in an appropriate way between the two types of error. This leads us to the idea of
structural risk minimization.
In this scenario, we assume we have a sequence of classes of functions, F1 , F2 , . . ., of increasing
complexity, meaning that F1 ⊂ F2 ⊂ . . .. For example, in a linear classification setting with
vectors x ∈ Rd , we might take a sequence of classes allowing increasing numbers of non-zeros in
the classification vector θ:
n o n o
F1 := fθ (x) = θ> x such that kθk0 ≤ 1 , F2 := fθ (x) = θ> x such that kθk0 ≤ 2 , . . . .
More broadly, let {Fk }k∈N be a (possibly infinite) increasing sequence of function classes. We
assume that for each Fk and each n ∈ N, there exists a constant Cn,k (δ) such that we have the
uniform generalization guarantee
!
P sup L b n (f ) − L(f ) ≥ Cn,k (δ) ≤ δ · 2−k .
f ∈Fk
(We will see in subsequent sections of the course how to obtain other more general guarantees.)
We consider the following structural risk minimization procedure. First, given the empirical
risk L
b n , we find the model collection b
k minimizing the penalized risk
k := argmin inf Ln (f ) + Cn,k (δ) .
b b (4.4.7a)
k∈N f ∈Fk
We then choose fb to minimize the risk over the estimated “best” class Fbk , that is, set
fb := argmin L
b n (f ). (4.4.7b)
f ∈Fkb
Theorem 4.4.12. Let fb be chosen according to the procedure (4.4.7a)–(4.4.7b). Then with proba-
bility at least 1 − δ, we have
96
Lexture Notes on Statistics and Information Theory John Duchi
On the event that supf ∈Fk |L b n (f ) − L(f )| < Cn,k (δ) for all k, which occurs with probability at least
1 − δ, we have
n o
L(fb) ≤ L
b n (f ) + C b (δ) = inf inf L
n,k
b n (f ) + Cn,k (δ) ≤ inf inf {L(f ) + 2Cn,k (δ)}
k∈N f ∈Fk k∈N f ∈Fk
We conclude with a final example, using our earlier floating point bound from Example 4.4.5,
coupled with Corollary 4.4.6 and Theorem 4.4.12.
Example 4.4.13 (Structural risk minimization with floating point classifiers): Consider
again our floating point example, and let the function class Fk consist of functions defined by
at most k double-precision floating point values, so that log |Fk | ≤ 45d. Then by taking
s
log 1δ + 65k log 2
Cn,k (δ) =
2n
we have that |Lb n (f )−L(f )| ≤ Cn,k (δ) simultaneously for all f ∈ Fk and all Fk , with probability
at least 1 − δ. Then the empirical risk minimization procedure (4.4.7) guarantees that
s
1
2 log δ + 91k
L(fb) ≤ inf inf L(f ) + .
k∈N f ∈Fk n
Roughly, we trade between small risk L(f )—as the qrisk inf f ∈Fk L(f ) must be decreasing in
k—and the estimation error penalty, which scales as (k + log 1δ )/n. 3
where for the last inequality we made the substitution u = t2 /σ 2 . Noting that this final integral is
Γ(k/2), we have E[|X|k ] ≤ kσ k Γ(k/2). Because Γ(s) ≤ ss for s ≥ 1, we obtain
p √
E[|X|k ]1/k ≤ k 1/k σ k/2 ≤ e1/e σ k.
Thus (2) holds with K2 = e1/e .
1 k
(2) implies (3) Let σ = kXkψ2 = supk≥1 k − 2 E[|X|k ]1/k , so that K2 = 1 and E[|X|k ] ≤ k 2 σ for
all k. For K3 ∈ R+ , we thus have
∞ ∞ ∞
E[X 2k ] σ 2k (2k)k (i) X 2e k
X X
E[exp(X 2 /(K3 σ 2 ))] = ≤ ≤
k!K32k σ 2k k!K32k σ 2k K32
k=0 k=0 k=0
where inequality (i) follows because k! ≥ (k/e)k , or 1/k! ≤ (e/k)k . Noting that ∞ k 1
P
p k=0 α = 1−α ,
we obtain (3) by taking K3 = e 2/(e − 1) ≈ 2.933.
97
Lexture Notes on Statistics and Information Theory John Duchi
(3) implies (4) Let us take K3 = 1. We claim that (4) holds with K4 = 34 . We prove this
result for both small and large λ. First, note the (highly non-standard, but true!) inequality that
9x2
ex ≤ x + e 16 for all x. Then we have
9λ2 X 2
E[exp(λX)] ≤ E[λX] +E exp
| {z } 16
=0
4
Now note that for |λ| ≤ 3σ ,
we have 9λ2 σ 2 /16
≤ 1, and so by Jensen’s inequality,
2 2
9λ X 2
2 2
2 9λ16σ 9λ2 σ 2
E exp = E exp(X /σ ) ≤ e 16 .
16
λ2 cx2
For large λ, we use the simpler Fenchel-Young inequality, that is, that λx ≤ 2c + 2 , valid for all
c ≥ 0. Then we have for any 0 ≤ c ≤ 2 that
2
λ2 σ 2 cX λ2 σ 2 c
E[exp(λX)] ≤ e 2c E exp 2
≤ e 2c e 2 ,
2σ
4 1 9 2 2
where the final inequality follows from Jensen’s inequality. If |λ| ≥ 3σ , then 2 ≤ 32 λ σ , and we
have 2 2
1
[ 2c 9c 2 2
+ 32 ]λ σ 3λ σ
E[exp(λX)] ≤ inf e = exp .
c∈[0,2] 4
1
(4) implies (1) This is the content of Proposition 4.1.8, with K4 = 2 and K1 = 2.
where we used the substitution u = t/σ. Thus we have E[|X|k ] ≤ 2Γ(k + 1)σ k , and using Γ(k + 1) ≤
k k yields E[|X|k ]1/k ≤ 21/k kσ, so that (2) holds with K2 ≤ 2.
where inequality (i) used that k! ≥ (k/e)k . Taking K3 = e2 /(e − 1) < 5 gives the result.
98
Lexture Notes on Statistics and Information Theory John Duchi
(2) if and only if (4) Thus, we see that up to constant numerical factors, the definition kXkψ1 =
supk≥1 k −1 E[|X|k ]1/k has the equivalent statements
Now, let us assume that (2) holds with K2 = 1, so that σ = kXkψ1 and that E[X] = 0. Then we
have E[X k ] ≤ k k kXkkψ1 , and
∞ ∞ ∞
X λk E[X k ] X kk X
E[exp(λX)] = 1 + ≤1+ λk
kXkkψ1 · ≤1+ λk kXkkψ1 ek ,
k! k!
k=2 k=2 k=2
1
the final inequality following because k! ≥ (k/e)k . Now, if |λ| ≤ 2ekXkψ , then we have
1
∞
X
E[exp(λX)] ≤ 1 + λ2 e2 kXkψ1 (λ kXkψ1 e)k ≤ 1 + 2e2 kXk2ψ1 λ2 ,
k=0
P∞ −k = 2. Using 1 + x ≤ ex gives that (2) implies (4). For
as the final sum is at most k=0 2
the opposite direction, we may simply use that if (4) holds with K4 = 1 and K40 = 1, then
E[exp(X/σ)] ≤ exp(1), so that (3) holds.
4.6 Bibliography
A few references on concentration, random matrices, and entropies include Vershynin’s extraordi-
narily readable lecture notes [170], upon which our proof of Theorem 4.1.11 is based, the compre-
hensive book of Boucheron, Lugosi, and Massart [34], and the more advanced material in Buldygin
and Kozachenko [41]. Many of our arguments are based off of those of Vershynin and Boucheron
et al. Kolmogorov and Tikhomirov [121] introduced metric entropy.
4.7 Exercises
Exercise 4.1 (Concentration of bounded random variables): Let X be a random variable taking
values in [a, b], where −∞ < a ≤ b < ∞. In this question, we show Hoeffding’s Lemma, that is,
that X is sub-Gaussian: for all λ ∈ R, we have
2
λ (b − a)2
E[exp(λ(X − E[X]))] ≤ exp .
8
(b−a)2
(a) Show that Var(X) ≤ ( b−a 2
2 ) = 4 for any random variable X taking values in [a, b].
(b) Let
ϕ(λ) = log E[exp(λ(X − E[X]))].
99
Lexture Notes on Statistics and Information Theory John Duchi
Assuming that E[X] = 0 (convince yourself that this is no loss of generality) show that
(You may assume that derivatives and expectations commute, which they do in this case.)
(c) Construct a random variable Yt , defined for t ∈ R, such that Yt ∈ [a, b] and
Exercise 4.2: In this question, we show how to use Bernstein-type (sub-exponential) inequal-
ities to give sharp convergence guarantees. Recall (Example 4.1.14, Corollary 4.1.18, and inequal-
ity (4.1.6)) that if Xi are independent bounded random variables with |Xi − E[X]| ≤ b for all i and
Var(Xi ) ≤ σ 2 , then
n n
( ! !)
5 nt2 nt
1X 1X 1
max P Xi ≥ E[X] + t , P Xi ≤ E[X] − t ≤ exp − min , .
n n 2 6 σ 2 2b
i=1 i=1
We consider minimization of loss functions ` over finite function classes F with ` ∈ [0, 1], so that if
L(f ) = E[`(f, Z)] then |`(f, Z) − L(f )| ≤ 1. Throughout this question, we let
We will show that, roughly, a procedure based on picking an empirical risk minimizer is unlikely to
choose a function f ∈ F with bad performance, so that we obtain faster concentration guarantees.
(b) Define the set of “bad” prediction functions F bad := {f ∈ F : L(f ) ≥ L? + }. Show that for
any fixed ≥ 0 and any f ∈ F2 bad , we have
n2
?
1 5 n
P L(f ) ≤ L + ≤ exp − min
b , .
2 6 L? (1 − L? ) + (1 − ) 2
n2
1 5 n
P L(fbn ) ≥ L(f ? ) + 2 ≤ card(F) · exp − min , .
2 6 L? (1 − L? ) + (1 − ) 2
100
Lexture Notes on Statistics and Information Theory John Duchi
(d) Using the result of part (c), argue that with probability at least 1 − δ,
q
|F | L? (1 − L? ) · log |Fδ |
r
? 4 log δ 12
L(fn ) ≤ L(f ) +
b + · √ .
n 5 n
Why is this better than an inequality based purely on the boundedness of the loss `, such as
Theorem 4.4.4 or Corollary 4.4.6? What happens when there is a perfect risk minimizer f ? ?
Exercise 4.3 (Likelihood ratio bounds and concentration): Consider a data release problem,
where given a sample x, we release a sequence of data Z1 , Z2 , . . . , Zn belonging to a discrete set Z,
where Zi may depend on Z1i−1 and x. We assume that the data has limited information about x
in the sense that for any two samples x, x0 , we have the likelihood ratio bound
p(zi | x, z1i−1 )
≤ eε .
p(zi | x0 , z1i−1 )
Let us control the amount of “information” (in the form of an updated log-likelihood ratio) released
by this sequential mechanism. Fix x, x0 , and define
p(z1 , . . . , zn | x)
L(z1 , . . . , zn ) := log .
p(z1 , . . . , zn | x0 )
t2
ε
P (L(Z1 , . . . , Zn ) ≥ nε(e − 1) + t) ≤ exp − .
2nε2
(b) Let γ ∈ (0, 1). Give the largest value of ε you can that is sufficient to guarantee that for any
test Ψ : Z n → {x, x0 }, we have
where Px and Px0 denote the sampling distribution of Z1n under x and x0 , respectively?
where Cp is a constant (that depends on p). As a corollary, derive that if E[|Xi |p ] ≤ σ p and p ≥ 2,
then
n
" #
p
1X σp
E Xi ≤ Cp p/2 .
n n
i=1
101
Lexture Notes on Statistics and Information Theory John Duchi
That is, sample means converge quickly to zero in higher moments. Hint: For any fixed x ∈ Rn , if
εi are i.i.d. uniform signs εi ∈ {±1}, then εT x is sub-Gaussian.
Exercise 4.5 (Small balls and anti-concentration): Let X be a nonnegative random variable
satisfying P(X ≤ ) ≤ c for some c < ∞ and all > 0. Argue that if Xi are i.i.d. copies of X, then
n
!
1X
P Xi ≥ t ≥ 1 − exp(−2n [1/2 − 2ct]2+ )
n
i=1
for all t.
Exercise 4.6 (Lipschitz functions remain sub-Gaussian): Let X be σ 2 -sub-Gaussian and f :
R → R be L-Lipschitz, meaning that |f (x) − f (y)| ≤ L|x − y| for all x, y. Prove that there exists a
numerical constant C < ∞ such that f (X) is CL2 σ 2 -sub-Gaussian.
Exercise 4.7 (Sub-gaussian maxima): Let X1 , . . . , Xn be σ 2 -sub-gaussian (not necessarily inde-
pendent) random variables. Show that
p
(a) E[maxi Xi ] ≤ 2σ 2 log n.
(b) There exists a numerical constant C < ∞ such that E[maxi |Xi |p ] ≤ (Cpσ 2 log k)p/2 .
Exercise 4.8: Consider a binary classification problem with logistic loss `(θ; (x, y)) = log(1 +
exp(−yθT x)), where θ ∈ Θ := {θ ∈ Rd | kθk1 ≤ r} and y ∈ {±1}. Assume additionally that the
space X ⊂ {x ∈ Rd | kxk∞ ≤ b}. Define the empirical and population risks L b n (θ) := Pn `(θ; (X, Y ))
and L(θ) := P `(θ; (X, Y )), and let θbn = argminθ∈Θ L(θ).
b Show that with probability at least 1 − δ
iid
over (Xi , Yi ) ∼ P , q
rb log dδ
L(θbn ) ≤ inf L(θ) + C √
θ∈Θ n
where C < ∞ is a numerical constant (you need not specify this).
Exercise 4.9 (Sub-Gaussian constants of Bernoulli random variables): In this exercise, we will
derive sharp sub-Gaussian constants for Bernoulli random variables (cf. [106, Thm. 1] or [118, 24]),
showing
1 − 2p 2
log E[et(X−p) ] ≤ t for all t ≥ 0. (4.7.1)
4 log 1−p
p
pet(1−p)
where Yt = (1 − p) with probability q(t) := pet(1−p) +(1−p)e−tp
and Yt = −p otherwise.
102
Lexture Notes on Statistics and Information Theory John Duchi
1 − 2p 2
f (s) = Cs + Cps − log(1 − p + peCs ),
2
so that inequality (4.7.1) holds if and only if f (s) ≥ 0 for all s ≥ 0. Give f 0 (s) and f 00 (s).
(e) Show that f (0) = f (1) = f 0 (0) = f 0 (1) = 0, and argue that f 00 (s) changes signs at most twice
and that f 00 (0) = f 00 (1) > 0. Use this to show that f (s) ≥ 0 for all s ≥ 0.
JCD Comment: Perhaps use transportation inequalities to prove this bound, and
also maybe give Ordentlich and Weinberger’s “A Distribution Dependent Refinement
of Pinsker’s Inequality” as an exercise.
1−2p
Exercise 4.10: Let s(p) = . Show that s is concave on [0, 1].
log 1−p
p
1. A hypothesis test likelihood ratio for them (see page 40 of handwritten notes)
2. A full learning guarantee with convergence of Hessian and everything, e.g., for logistic
regression?
3. In the Ledoux-Talagrand stuff, maybe worth going through example of logistic regres-
sion. Also, having working logistic example throughout? Helps clear up the structure
and connect with exponential families.
103
Chapter 5
Concentration inequalities provide powerful techniques for demonstrating when random objects
that are functions of collections of independent random variables—whether sample means, functions
with bounded variation, or collections of random vectors—behave similarly to their expectations.
This chapter continues exploration of these ideas by incorporating the central thesis of this book:
that information theory’s connections to statistics center around measuring when (and how) two
probability distributions get close to one another. On its face, we remain focused on the main
objects of the preceding chapter, where we have a population probability distribution P on a space
X and some collection of functions f : X → R. We then wish to understand when we expect the
empirical distribution
n
1X
Pn := 1Xi ,
n
i=1
iid
defined by teh sample Xi ∼ P , to be close to the population P as measured by f . Following the
notation we introduce in Section 4.3, for P f := EP [f (X)], we again ask to have
n
1X
Pn f − P f = f (Xi ) − EP [f (X)]
n
i=1
104
Lexture Notes on Statistics and Information Theory John Duchi
where the supremum is taken over measurable functions g : X → R with EQ [eg(X) ] < ∞.
We give one proof of this result and one sketch of a proof, which holds when the underlying space
is discrete, that may be more intuitive: the first constructs a particular “tilting” of Q via the
function eg , and verifies the equality. The second relies on the discretization of the KL-divergence
and may be more intuitive to readers familiar with convex optimization: essentially, we expect this
result because the function log( kj=1 exj ) is the convex conjugate of the negative entropy. (See also
P
Exercise 5.1.)
Proof We may assume that P is absolutely continuous with respect to Q, meaning that Q(A) = 0
implies that P (A) = 0, as otherwise both sides are infinite by inspection. Thus, it is no loss of
generality to let P and Q have densities p and q.
Attainment in the equality is easy: we simply take g(x) = log p(x) q(x) , so that EQ [e
g(X) ] = 1. To
show that the right hand side is never larger than Dkl (P ||Q) requires a bit more work. To that
end, let g be any function such that EQ [eg(X) ] < ∞, and define the random variable Zg (x) =
eg(x) /EQ [eg(X) ], so that EQ [Z] = 1. Then using the absolute continuity of P w.r.t. Q, we have
p(X) q(X) dQ
EP [log Zg ] = EP log + log Zg (X) = Dkl (P ||Q) + EP log Zg
q(X) p(X) dP
dQ
≤ Dkl (P ||Q) + log EP Zg
dP
= Dkl (P ||Q) + log EQ [Zg ].
As EQ [Zg ] = 1, using that EP [log Zg ] = EP [g(X)] − log EQ [eg(X) ] gives the result.
Here is the second proof of Theorem 5.1.1, which applies when X is discrete and finite. That we
can approximate KL-divergence by suprema over finite partitions (as in definition (2.2.1)) suggests
that this approach works in general—which it can—but this requires some not completely trivial
approximations of EP [g] and EQ [eg ] by discretized versions of their expectations, which makes
things rather tedious.
Proof of Theorem 5.1.1, the finite case As we have assumed that P and Q have finite
supports, which we identify with {1, . . . , k} and p.m.f.s p, q ∈ ∆k = {p ∈ Rk+ | h1, pi = 1}. Define
fq (v) = log( kj=1 qj evj ), which is convex in v (recall Proposition 3.2.1). Then the supremum in
P
the variational representation takes the form
h(p) := sup {hp, vi − fq (v)} .
v∈Rk
105
Lexture Notes on Statistics and Information Theory John Duchi
If we can take derivatives and solve for zero, we are guaranteed to achieve the supremum. To that
end, note that
" #k
qi evi
∇v {hp, vi − fq (v)} = p − Pk ,
q evj
j=1 j i=1
p
so that setting vj = log qjj achieves p − ∇v fq (v) = p − p = 0 and hence the supremum. Noting that
p
log( kj=1 qj exp(log qjj )) = log( kj=1 pj ) = 0 gives h(p) = Dkl (p||q).
P P
The Donsker-Varadhan variational representation already gives a hint that we can use some
information-theoretic techniques to control the difference between an empirical sample and its
expectation, at least in an average sense. In particular, we see that for any function g, we have
for any random variable X. Now, changing this on its head a bit, suppose that we consider a
collection of functions F and put two probability measures π and π0 on F, and consider Pn f − P f ,
where we consider f a random variable f ∼ π or f ∼ π0 . Then a consequence of the Donsker-
Varadhan theorem is that
Z Z
(Pn f − P f )dπ(f ) ≤ Dkl (π||π0 ) + log exp(Pn f − P f )dπ0 (f )
for any π, π0 . While this inequality is a bit naive—bounding a difference by an exponent seems
wasteful—as we shall see, it has substantial applications when we can upper bound the KL-
divergence Dkl (π||π0 ).
106
Lexture Notes on Statistics and Information Theory John Duchi
Proof The key is to combine Example 4.1.12 with the variational representation that Theo-
rem 5.1.1 provides for KL-divergences. We state Example 4.1.12 as a lemma here.
PWithout loss of generality, we assume that P f = 0 for all f ∈ F, and recall that Pn f =
1 n 2
n i=1 f (Xi ) is the empirical mean of f . Then we know that Pn f is σ /n-sub-Gaussian, and
−1/2
Lemma 5.2.2 implies that E[exp(λ(Pn f )2 )] ≤ 1 − 2λσ 2 /n +
for any f , and thus for any prior
π0 on f we have Z
−1/2
exp(λ(Pn f ) )dπ0 (f ) ≤ 1 − 2λσ 2 /n + .
2
E
3n
Consequently, taking λ = λn := 8σ 2 , we obtain
Z Z
2 3n 2
E exp(λn (Pn f ) )dπ0 (f ) = E exp (Pn f ) dπ0 (f ) ≤ 2.
8σ 2
By Jensen’s inequality (or Cauchy-Schwarz), it is immediate from Theorem 5.2.1 that we also
have
s
8σ 2 Dkl (π||π0 ) + log 2δ
Z
|Pn f − P f |dπ(f ) ≤ simultaneously for all π ∈ Π (5.2.2)
3 n
√
with probability at least 1 − δ, so that Eπ [|Pn f − P f |] is with high probability of order 1/ n. The
inequality (5.2.2) is the original form of the PAC-Bayes bound due to McAllester, with slightly
107
Lexture Notes on Statistics and Information Theory John Duchi
sharper constants and improved logarithmic dependence. The key is that stability, in the form of a
prior π0 and posterior π closeness, allow us to achieve reasonably tight control over the deviations
of random variables and functions with high probability.
Let us give an example, which is similar to many of our approaches in Section 4.4, to illustrate
some of the approaches this allows. The basic idea is that by appropriate choice of prior π0
and “posterior” π, whenever we have appropriately smooth classes of functions we achieve certain
generalization guarantees.
Example 5.2.3 (A uniform law for Lipschitz functions): Consider a case as in Section 4.4,
where we let L(θ) = P `(θ, Z) for some function ` : Θ × Z → R. Let Bd2 = {v ∈ Rd | kvk2 ≤ 1}
be the `2 -ball in Rd , and let us assume that Θ ⊂ rBd2 and additionally that θ 7→ `(θ, z) is
M -Lipschitz for all z ∈ Z. For simplicity, we assume that `(θ, z) ∈ [0, 2M r] for all θ ∈ Θ (we
may simply relativize our bounds by replacing ` by `(·, z) − inf θ∈Θ `(θ, z) ∈ [0, 2M r]).
If L
b n (θ) = Pn `(θ, Z), then Theorem 5.2.1 implies that
s
2 r2
Z
8M 2
|L
b n (θ) − L(θ)|dπ(θ) ≤ Dkl (π||π0 ) + log
3n δ
for all π with probability at least 1 − δ. Now, let θ0 ∈ Θ be arbitrary, and for > 0 (to be
chosen later) take π0 to be uniform on (r + )Bd2 and π to be uniform on θ0 + Bd2 . Then we
immediately see that Dkl (π||π0 ) = d log(1+ r ). Moreover, we have L
R
b n (θ)dπ(θ) ∈ L
b n (θ0 )±M
and similarly for L(θ), by the M -Lipschitz continuity of `. For any fixed > 0, we thus have
s
2M 2 r2
r 2
|Ln (θ0 ) − L(θ0 )| ≤ 2M +
b d log 1 + + log
3n δ
rd
simultaneously for all θ0 ∈ Θ, with probability at least 1 − δ. By choosing = n we obtain
that with probability at least 1 − δ,
s
8M 2 r2
2M rd n 2
sup |Ln (θ) − L(θ)| ≤
b + d log 1 + + log .
θ∈Θ n 3n d δ
q
Thus, roughly, with high probability we have |L
b n (θ) − L(θ)| ≤ O(1)M r d
n log nd for all θ. 3
On the one hand, the result in Example 5.2.3 is satisfying: it applies to any Lipschitz function
and provides a uniform bound. On the other hand, when we compare to the results achievable for
specially structured linear function classes, then applying Rademacher complexity bounds—such
as Proposition 4.4.9 and Example 4.4.10—we have somewhat weaker results, in that they depend
on the dimension explicitly, while the Rademacher bounds do not exhibit this explicit dependence.
This means they can potentially apply in infinite dimensional spaces that Example 5.2.3 cannot.
We will give an example presently showing how to address some of these issues.
108
Lexture Notes on Statistics and Information Theory John Duchi
variable X satisfies |X| ≤ b but Var(X) ≤ σ 2 b2 , then X concentrates more quickly about
its mean than the convergence provided by naive application of sub-Gaussian concentration with
sub-Gaussian parameter b2 /8. To that end, we investigate an alternative to Theorem 5.2.1 that
allows somewhat sharper control.
The approach is similar to our derivation in Theorem 5.2.1, where we show that the moment
generating function of a quantity like Pn f − P f is small (Eq. (5.2.1)) and then relate this—via the
Donsker-Varadhan change of measure in Theorem 5.1.1—to the quantities we wish to control. In
the next proposition, we provide relative bounds on the deviations of functions from their means.
To make this precise, let F be a collection of functions f : X → R, and let σ 2 (f ) := Var(f (X)) be
the variance of functions in F. We assume the class satisfies the Bernstein condition (4.1.7) with
parameter b, that is,
h i k!
E (f (X) − P f )k ≤ σ 2 (f )bk−2 for k = 3, 4, . . . . (5.2.3)
2
This says that the second moment of functions f ∈ F bounds—with the additional boundedness-
type constant b—the higher moments of functions in f . We then have the following result.
Proof We begin with an inequality on the moment generating function of random variables
satisfying the Bernstein condition (4.1.7), that is, that |E[(X − µ)k ]| ≤ k! 2 k−2 for k ≥ 2. In this
2σ b
case, Lemma 4.1.19 implies that
E[eλ(X−µ) ] ≤ exp(λ2 σ 2 )
for |λ| ≤ 1/(2b). As a consequence, for any f in our collection F, we see that if we define
∆n (f, λ) := λ Pn f − P f − λσ 2 (f ) ,
we have that
E[exp(n∆n (f, λ))] = E[exp(λ(f (X) − P f ) − λ2 σ 2 (f ))]n ≤ 1
1
for all n, f ∈ F, and |λ| ≤ 2b .Then, for any fixed measure π0 on F, Markov’s inequality implies
that Z
1
P exp(n∆n (f, λ))dπ0 (f ) ≥ ≤ δ. (5.2.4)
δ
Now, as in the proof of Theorem 5.2.1, we use the Donsker-Varadhan Theorem 5.1.1 (change of
measure), which implies that
Z Z
n ∆n (f, λ)dπ(f ) ≤ Dkl (π||π0 ) + log exp(n∆n (f, λ))dπ0 (f )
for all distributions π. Using inequality (5.2.4), we obtain that with probability at least 1 − δ,
Z
1 1
∆n (f, λ)dπ(f ) ≤ Dkl (π||π0 ) + log
n δ
109
Lexture Notes on Statistics and Information Theory John Duchi
for all π. As this holds for any fixed |λ| ≤ 1/(2b), this gives the desired result by rearranging.
We would like to optimize over the bound in Proposition 5.2.4 by choosing the “best” λ. If we
could choose the optimal λ, by rearranging Proposition 5.2.4 we would obtain the bound
2 1 h 1i
Eπ [P f ] ≤ Eπ [Pn f ] + inf λEπ [σ (f )] + Dkl (π||π0 ) + log
λ>0 nλ δ
r
Eπ [σ 2 (f )] h 1i
= Eπ [Pn f ] + 2 Dkl (π||π0 ) + log
n δ
simultaneously for all π, with probability at least 1−δ. The problem with this approach is two-fold:
first, we cannot arbitrarily choose λ in Proposition 5.2.4, and second, the bound above depends on
the unknown population variance σ 2 (f ). It is thus of interest to understand situations in which
we can obtain similar guarantees, but where we can replace unknown population quantities on the
right side of the bound with known quantities.
To that end, let us consider the following condition, a type of relative error condition related
to the Bernstein condition (4.1.7): for each f ∈ F,
σ 2 (f ) ≤ bP f. (5.2.5)
This condition is most natural when each of the functions f take nonnegative values—for example,
when f (X) = `(θ, X) for some loss function ` and parameter θ of a model. If the functions f are
nonnegative and upper bounded by b, then we certainly have σ 2 (f ) ≤ E[f (X)2 ] ≤ bE[f (X)] = bP f ,
so that Condition (5.2.5) holds. Revisiting Proposition 5.2.4, we rearrange to obtain the following
theorem.
Theorem 5.2.5. Let F be a collection of functions satisfying the Bernstein condition (5.2.3) as in
Proposition 5.2.4, and in addition, assume the variance-bounding condition (5.2.5). Then for any
1
0 ≤ λ ≤ 2b , with probability at least 1 − δ,
λb 1 1h 1i
Eπ [P f ] ≤ Eπ [Pn f ] + Eπ [Pn f ] + Dkl (π||π0 ) + log
1 − λb λ(1 − λb) n δ
for all π.
apply Proposition 5.2.4, and divide both sides of the resulting inequality by λ(1 − λb).
To make this uniform in λ, thus achieving a tighter bound (so that we need not pre-select λ),
1 λb
we choose multiple values of λ and apply a union bound. To that end, let 1 + η = 1−λb , or η = 1−λb
1 (1+η)2
and λb(1−λb) = η , so that the inequality in Theorem 5.2.1 is equivalent to
(1 + η)2 b h 1i
Eπ [P f ] ≤ Eπ [Pn f ] + ηEπ [Pn f ] + Dkl (π||π0 ) + log .
η n δ
110
Lexture Notes on Statistics and Information Theory John Duchi
Corollary 5.2.6. Let the conditions of Theorem 5.2.5 hold. Then with probability at least 1 − δ,
r
bEπ [Pn f ] h ni 1 h n i
Eπ [P f ] ≤ Eπ [Pn f ] + 2 Dkl (π||π0 ) + log + Eπ [Pn f ] + 5b Dkl (π||π0 ) + log ,
n δ n δ
simultaneously for all π on F.
for each η ∈ {1/n, . . . , 1}. We consider two cases. In the first, assume that Eπ [Pn f ] ≤ nb (Dkl (π||π0 )+
log nδ . Then taking η = 1 above evidently gives the result. In the second, we have Eπ [Pn f ] >
b n
n (Dkl (π||π0 ) + log δ ), and we can set
s
b
(Dkl (π||π0 ) + log nδ )
η? = n ∈ (0, 1).
Eπ [Pn f ]
1
Choosing η to be the smallest value ηk in {η1 , . . . , ηn } with ηk ≥ η? , so that η? ≤ η ≤ η? + n then
implies the claim in the corollary.
We call the quantity hθ, xiy the margin of θ on the pair (x, y), noting that when the margin is
large, hθ, xi has the same sign as y and is “confident” (i.e. far from zero). For shorthand, let us
define the expected and empirical losses at margin γ by
Consider the following scenario: the data x lie in a ball of radius b, so that kxk2 ≤ b; note that
the losses `γ and `0 satisfy the Bernstein (5.2.3) and self-bounding (5.2.5) conditions with constant
1 as they take values in {0, 1}. We then have the following proposition.
111
Lexture Notes on Statistics and Information Theory John Duchi
Proposition 5.2.7. Let the above conditions on the data (x, y) hold and let the margin γ > 0 and
radius r < ∞. Then with probability at least 1 − δ,
√ rb log n p r2 b2 log nδ
1
P (hθ, XiY ≤ 0) ≤ 1 + Pn (hθ, XiY ≤ γ) + 8 √ δ Pn (hθ, XiY ≤ γ) + C
n γ n γ2n
simultaneously for all kθk2 ≤ r, where C is a numerical constant independent of the problem
parameters.
Proposition 5.2.7 provides a “dimension-free” guarantee—it depends only on the `2 -norms kθk2
and kxk2 —so that it can apply equally in infinite dimensional spaces. The key to the inequality
is that if we can find a large margin predictor—for example, one achieved by a support vector
machine or, more broadly, by minimizing a convex loss of the form
n
1X
minimize φ(hXi , θiYi )
kθk2 ≤r n
i=1
for some decreasing convex φ : R → R+ , e.g. φ(t) = [1 − t]+ or φ(t) = log(1 + e−t )—then we get
strong generalization performance guarantees relative to the empirical margin γ. As one particular
instantiation of this approach, suppose we can obtain a perfect classifier with positive margin: a
vector θ with kθk2 ≤ r such that hθ, Xi iYi ≥ γ for each i = 1, . . . , n. Then Proposition 5.2.7
guarantees that
r2 b2 log nδ
P (hθ, XiY ≤ 0) ≤ C
γ2n
with probability at least 1 − δ.
Proof Let π0 be N(0, τ 2 I) for some τ > 0 to be chosen, and let π be N(θ,
b τ 2 I) for some θb ∈ Rd
satisfying kθk
b 2 ≤ r. Then Corollary 5.2.6 implies that
Eπ [Lγ (θ)]
s
Eπ [L
b γ (θ)] h ni 1 b γ (θ)] + C Dkl (π||π0 ) + log n
h i
≤ Eπ [L
b γ (θ)] + 2 Dkl (π||π0 ) + log + Eπ [L
n δ n δ
s
h 2 h 2 i
b γ (θ)] + 2 Eπ [Lγ (θ)] r + log n + 1 Eπ [L b γ (θ)] + C r + log n
b i
≤ Eπ [L
n 2τ 2 δ n 2τ 2 δ
Let us use the margin assumption. Note that if Z ∼ N(0, τ 2 I), then for any fixed θ0 , x, y we
have
`0 (θ0 ; (x, y)) − P(Z > x ≥ γ) ≤ E[`γ (θ0 + Z; (x, y))] ≤ `2γ (θ0 ; (x, y)) + P(Z > x ≥ γ)
where the middle expectation is over Z ∼ N(0, τ 2 I). Using the τ 2 kxk22 -sub-Gaussianity of Z > x, we
can obtain immediately that if kxk2 ≤ b, we have
γ2 γ2
`0 (θ0 ; (x, y)) − exp − 2 2 ≤ E[`γ (θ0 + Z; (x, y))] ≤ `2γ (θ0 ; (x, y)) + exp − 2 2 .
2τ b 2τ b
112
Lexture Notes on Statistics and Information Theory John Duchi
Returning to our earlier bound, we evidently have that if kxk2 ≤ b for all x ∈ X , then with
probability at least 1 − δ, simultaneously for all θ ∈ Rd with kθk2 ≤ r,
s
γ 2
b 2γ (θ) + exp(− γ22 2 ) h r2
L ni
2τ b
L0 (θ) ≤ L2γ (θ) + 2 exp − 2 2 + 2
b + log
2τ b n 2τ 2 δ
2 2
1 b γ h r n i
+ L2γ (θ) + exp − 2 2 + C 2
+ log .
n 2τ b 2τ δ
2
Setting τ 2 = 2b2γlog n , we immediately see that for any choice of margin γ > 0, we have with
probability at least 1 − δ that
s
2b 1 hb b ih r2 b2 log n ni
L0 (θ) ≤ Lb 2γ (θ) + +2 L2γ (θ) + + log
n n n 2γ 2 δ
r2 b2 log n
1 b 1 h n i
+ L2γ (θ) + + C + log
n n 2γ 2 δ
8σ 2
2 n n 2
E[(Pn F − P F ) | X1 ] ≤ Dkl (π(· | X1 )||π0 ) + log ,
3n δ
where the expectation is taken over F ∼ π(· | X1n ), leaving the sample fixed. Now, consider choosing
π0 to be the average over all samples X1n of π, that is, π0 (·) = EP [π(· | X1n )], the expectation taken
iid
over X1n ∼ P . Then by definition of mutual information,
113
Lexture Notes on Statistics and Information Theory John Duchi
Corollary 5.2.8. Let F be chosen according to any distribution π(· | X1n ) conditional on the sample
iid
X1n . Then with probability at least 1 − δ0 − δ1 over the sample X1n ∼ P ,
8σ 2 I(F ; X1n )
2 n 2
E[(Pn F − P F ) | X1 ] ≤ + log .
3n δ0 δ1
This corollary shows that if we have any procedure—say, a learning procedure or otherwise—
that limits the information between a sample X1n and an output F , then we are guaranteed that
F generalizes. Tighter analyses of this are possible, though not our focus here, just that already
there should be an inkling that limiting information between input samples and outputs may be
fruitful.
114
Lexture Notes on Statistics and Information Theory John Duchi
information about the sample. The starting point of our approach is similar to our analysis of
PAC-Bayesian learning and generalization: we observe that if the function we decide to compute
on the data X1n is chosen without much information about the data at hand, then its value on the
sample should be similar to its values on the full population. This insight dovetails with what we
have seen thus far, that appropriate “stability” in information can be useful and guarantee good
future performance.
by Corollary 4.1.10 (sub-Gaussian concentration) and a union bound. Thus, so long as |Φ| is not
exponential in the sample size n, we expect uniformly high accuracy.
Example 5.3.1 (Risk minimization via statistical queries): Suppose that we are in the loss-
minimization setting (4.4.2), where the losses `(θ, Xi ) are convex and differentiable in θ. Then
gradient descent applied to Lb n (θ) = Pn `(θ, X) will converge to a minimizing value of Lb n . We
can evidently implement gradient descent by a sequence of statistical queries φ(x) = ∇θ `(θ, x),
iterating
θ(k+1 ) = θ(k) − αk Pn φ(k) , (5.3.2)
where φ(k) = ∇θ `(θ(k) , x) and αk is a stepsize. 3
One issue with the example (5.3.1) is that we are interacting with the dataset, because each
sequential query φ(k) depends on the previous k − 1 queries. (Our results on uniform convergence
of empirical functionals and related ideas address many of these challenges, so that the result of
the process (5.3.2) will be well-behaved regardless of the interactivity.)
We consider an interactive version of the statistical query estimation problem. In this version,
there are two parties: an analyst (or statistician or learner), who issues queries φ : X → R, and
a mechanism that answers the queries to the analyst. We index our functionals φ by t ∈ T for a
(possibly infinite) set T , so we have a collection {φt }t∈T . In this context, we thus have the following
scheme:
115
Lexture Notes on Statistics and Information Theory John Duchi
Input: Sample X1n drawn i.i.d. P , collection {φt }t∈T of possible queries
Repeat: for k = 1, 2, . . .
Of interest in the iteration 5.1 is that we interactively choose T1 , T2 , . . . , Tk , where the choice Ti
may depend on our approximations of EP [φTj (X)] for j < i, that is, on the results of our previous
queries. Even more broadly, the analyst may be able to choose the index Tk in alternative ways
depending on the sample X1n , and our goal is to still be able to accurately compute expectations
P φT = EP [φT (X)] when the index T may depend on X1n . The setting in Figure 5.1 clearly breaks
with the classical statistical setting in which an analysis is pre-specified before collecting data, but
more closely captures modern data exploration practices.
Theorem 5.3.2. Let {φt }t∈T be a collection of σ 2 -sub-Gaussian functions φt : X → R. Then for
any random variable T and any λ > 0,
2 1 n 1 2
E[(Pn φT − P φT ) ] ≤ I(X1 ; T ) − log 1 − 2λσ /n +
λ 2
and r
2σ 2
|E[Pn φT ] − E[P φT ]| ≤I(X1n ; T )
n
where the expectations are taken over T and the sample X1n .
Proof The proof is similar to that of our first basic PAC-Bayes result in Theorem 5.2.1. Let
us assume w.l.o.g. that P φt = 0 for all t ∈ T , noting that then Pn φt is σ 2 /n-sub-Gaussian. We
−1/2
prove the first result first. Lemma 5.2.2 implies that E[exp(λ(Pn φt )2 )] ≤ 1 − 2λσ 2 /n + for each
116
Lexture Notes on Statistics and Information Theory John Duchi
inequality (iii) is Lemma 5.2.2.) Now, let π0 be the marginal distribution on T (marginally over
all observations X1n ), and let π denote the posterior of T conditional on the sample X1n . Then
E[Dkl (π||π0 )] = I(X1n ; T ) by definition of the mutual information, giving the bound on the squared
error.
For the second result, note that the Donsker-Varadhan equality implies
λ2 σ 2
Z Z
λE Pn φt dπ(t) ≤ E[Dkl (π||π0 )] + log E[exp(λPn φt )]dπ0 (t) ≤ I(X1n ; T ) + .
2n
p
Dividing both sides by λ gives E[Pn φT ] ≤ 2σ 2 I(X1n ; T )/n, and performing the same analysis with
−φT gives the second result of the theorem.
The key in the theorem is that if the mutual information—the Shannon information—I(X; T )
between the sample X and T is small, then the expected squared error can be small. To make this
n
a bit clearer, let us choose values for λ in the theorem; taking λ = 2eσ 2 gives the following corollary.
for a numerical constant C. That is, powers of sub-Gaussian maxima grow at most logarith-
mically. Indeed, by Theorem 4.1.11, we have for any q ≥ 1 by Hölder’s inequality that
X 1/q
p pq
E[max |Zj | ] ≤ E |Zj | ≤ k 1/q (Cpqσ 2 )p/2 ,
j
j
117
Lexture Notes on Statistics and Information Theory John Duchi
and setting q = log k gives the inequality. Thus, we see that for any a priori fixed v1 , . . . , vk , vk+1 ,
we have
log k
E[max(vjT (Pn Y X))2 ] ≤ O(1) .
j n
If instead we allow a single interaction, the problem is different. We issue queries associated
with v = e1 , . . . , ek , the k standard basis vectors; then we simply set Vk+1 = Pn Y X/ kPn Y Xk2 .
Then evidently
k
T
E[(Vk+1 (Pn Y X))2 ] = E[kPn Y Xk22 ] = ,
n
which is exponentially larger than in the non-interactive case. That is, if an analyst is allowed
to interact with the dataset, he or she may be able to discover very large correlations that are
certainly false in the population, which in this case has P XY = 0. 3
Example 5.3.4 shows that, without being a little careful, substantial issues may arise in interac-
tive data analysis scenarios. When we consider our goal more broadly, which is to be able to provide
accurate approximations to P φ for queries φ chosen adaptively for any population distribution P
and φ : X → [−1, 1], it is possible to construct quite perverse situations, where if we compute
sample expectations Pn φ exactly, one round of interaction is sufficient to find a query φ for which
Pn φ − P φ ≥ 1.
Example 5.3.5 (Exact query answering allows arbitrary corruption): Suppose we draw a
iid
sample X1n of size n on a sample space X = [m] with Xi ∼ Uniform([m]), where m ≥ 2n. Let
Φ be the collection of all functions φ : [m] → [−1, 1], so that P(|Pn φ − P φ| ≥ t) ≤ exp(−nt2 /2)
for any fixed φ. Suppose that in the interactive scheme in Fig. 5.1, we simply release answers
A = Pn φ. Consider the following query:
More generally, when one performs an interactive data analysis (e.g. as in Fig. 5.1), adapting
hypotheses while interacting with a dataset, it is not a question of statistical significance or mul-
tiplicity control for the analysis one does, but for all the possible analyses one might have done
otherwise. Given the branching paths one might take in an analysis, it is clear that we require
some care.
118
Lexture Notes on Statistics and Information Theory John Duchi
With that in mind, we consider the desiderata for techniques we might use to control information
in the indices we select. We seek some type of stability in the information algorithms provide
to a data analyst—intuitively, if small changes to a sample do not change the behavior of an
analyst substantially, then we expect to obtain reasonable generalization bounds. If outputs of a
particular analysis procedure carry little information about a particular sample (but instead provide
information about a population), then Corollary 5.3.3 suggests that any estimates we obtain should
be accurate.
To develop this stability theory, we require two conditions: first, that whatever quantity we
develop for stability should compose adaptively, meaning that if we apply two (randomized) algo-
rithms to a sample, then if both are appropriately stable, even if we choose the second algorithm
because of the output of the first in arbitrary ways, they should remain jointly stable. Second, our
notion should bound the mutual information I(X1n ; T ) between the sample X1n and T . Lastly, we
remark that this control on the mutual information has an additional benefit: by the data process-
ing inequality, any downstream analysis we perform that depends only on T necessarily satisfies the
same stability and information guarantees as T , because if we have the Markov chain X1n → T → V
then I(X1n ; V ) ≤ I(X1n ; T ).
We consider randomized algorithms A : X n → A, taking values in our index set A, where
A(X1n ) ∈ A is a random variable that depends on the sample X1n . For simplicity in derivation,
we abuse notation in this section, and for random variables X and Y with distributions P and Q
respectively, we denote
Dkl (X||Y ) := Dkl (P ||Q) .
We then ask for a type of leave-one-out stability for the algorithms A, where A is insensitive to the
changes of a single example (on average).
n n
1X n
1 X 1 2 1
Dkl A(x1 )||A(x\i ) = x ≤ 2 2,
n 2nσ 2 n2 i 2σ n
i=1 i=1
so that a the sample mean of a bounded random variable perturbed with Guassian noise is
ε = 2σ12 n2 -KL-stable. 3
Example 5.3.7 (KL-stability in mean estimation: Laplace noise addition): Let the conditions
of Example 2.1.7 hold, but suppose instead of Gaussian noise we add scaled Laplace noise,
119
Lexture Notes on Statistics and Information Theory John Duchi
We require a bit of notational trickery now. Fixing i, let PA,A0 be the joint distribution of
A0 (A(xn1 ), xn1 ) and A(xn1 ) and QA,A0 the joint distribution of A0i (Ai (x\i ), x\i ) and Ai (x\i ), so that
they are both distributions over A1 × A0 . Let PA0 |a be the distribution of A0 (t, xn1 ) and similarly
QA0 |a is the distribution of A0i (t, x\i ). Note that A0 , A0i both “observe” x, so that using the chain
rule (2.1.6) for KL-divergences, we have
Dkl A0 ◦ A, A||A0i ◦ Ai , Ai = Dkl PA,A0 ||QA,A0
Z
= Dkl (PA ||QA ) + Dkl PA0 |t ||QA0 |t dPA (t)
as desired.
The second key result is that KL-stable algorithms also bound the mutual information of a
random function.
Lemma 5.3.9. Let Xi be independent. Then for any random variable A,
Xn n Z
X
n
Dkl A(xn1 )||Ai (x\i ) dP (xn1 ),
I(A; X1 ) ≤ I(A; Xi | X\i ) =
i=1 i=1
120
Lexture Notes on Statistics and Information Theory John Duchi
Proof Without loss of generality, we assume A and X are both discrete. In this case, we have
Xn Xn
i−1
n
I(A; X1 ) = I(A; Xi | X1 ) = H(Xi | X1i−1 ) − H(Xi | A, X1i−1 ).
i=1 i=1
Now, because the Xi follow a product distribution, H(Xi | X1i−1 ) = H(Xi ), while H(Xi |
A, X1i−1 ) ≥ H(Xi | A, X\i ) because conditioning reduces entropy. Consequently, we have
n
X n
X
I(A; X1n ) ≤ H(Xi ) − H(Xi | A, X\i ) = I(A; Xi | X\i ).
i=1 i=1
To see the final equality, note that
Z
I(A; Xi | X\i ) = I(A; Xi | X\i = x\i )dP (x\i )
X n−1
Z Z
= Dkl (A(xn1 )||A(x1:i−1 , Xi , xi+1:n )) dP (xi )dP (x\i )
X n−1 X
by definition of mutual information as I(X; Y ) = EX [Dkl PY |X ||PY ].
Combining Lemmas 5.3.8 and 5.3.9, we see (nearly) immediately that KL stability implies
a mutual information bound, and consequently even interactive KL-stable algorithms maintain
bounds on mutual information.
Proposition 5.3.10. Let A1 , . . . , Ak be εi -KL-stable procedures, respectively, composed in any
arbitrary sequence. Let Xi be independent. Then
k
1 X
I(A1 , . . . , Ak ; X1n ) ≤ εi .
n
i=1
Proof Applying Lemma 5.3.9,
n k X
n
I(Aj ; Xi | X\i , Aj−1
X X
I(Ak1 ; X1n ) ≤ I(Ak1 ; Xi | X\i ) = 1 ).
i=1 j=1 i=1
where A(a0 , xn1 ) is the (random) procedure A on inputs xn1 and a0 , while A(a0 , x\i ) denotes the
(random) procedure A on input a0 , x\i , Xi , and where the ith example Xi follows its disdtribution
conditional on A0 = a0 and X\i = x\i , as in Lemma 5.3.9. We then recognize that for each i, we
have
Z Z
Dkl A(a , x1 )||A(a , x\i ) dP (xi | a , x\i ) ≤ Dkl A(a0 , xn1 )||A(a
0 n 0 0 e 0 , x\i ) dP (xi | a0 , x\i )
for any randomized function A, e as the marginal A in the lemma minimizes the average KL-
divergence (recall Exercise 2.15). Now, sum over i and apply the definition of KL-stability as
in Lemma 5.3.8.
121
Lexture Notes on Statistics and Information Theory John Duchi
Input: Sample X1n ∈ X n drawn i.i.d. P , collection {φt }t∈T of possible queries φt : X →
[−1, 1]
Repeat: for k = 1, 2, . . .
This procedure is evidently KL-stable, and based on Example 5.3.6 and Proposition 5.3.10, we
have that
1 k
I(X1n ; T1 , . . . , Tk , Tk+1 ) ≤ 2 2
n 2σ n
so long as the indices Ti ∈ T are chosen only as functions of Pn φ + Zj for j < i, as the classical
information processing inequality implies that
1 1
I(X1n ; T1 , . . . , Tk , Tk+1 ) ≤ I(X1n ; A1 , . . . , Ak )
n n
because we have X1n → A1 → T2 and so on for the remaining indices. With this, we obtain the
following theorem.
Theorem 5.3.11. Let the indices Ti , i = 1, . . . , k + 1 be chosen in an arbitrary way using the
procedure 5.2, and let σ 2 > 0. Then
2 2ek 10
E max(Aj − P φTj ) ≤ 2 2 + + 4σ 2 (log k + 1).
j≤k σ n 4n
p
By inspection, we can optimize over σ 2 by setting σ 2 = k/(log k + 1)/n, which yields the
upper bound p
2 10 k(1 + log k)
E max(Aj − P φTj ) ≤ + 10 .
j≤k 4n n
Comparing to Example 5.3.4, we see a substantial improvement. While we do not achieve accuracy
scaling with log k, as we would if the queried functionals φt were completely independent of the
sample, we see that we achieve mean-squared error of order
√
k log k
n
122
Lexture Notes on Statistics and Information Theory John Duchi
123
Lexture Notes on Statistics and Information Theory John Duchi
with probability at least 1 − δ suggest that we can optimize them by choosing π carefully. For
example, in the context of learning a statistical model parameterized by θ ∈ Θ with losses `(θ; x, y),
it is natural to attempt to find π minimizing
r
1
Eπ [Pn `(θ; X, Y ) | Pn ] + C Dkl (π||π0 )
n
in π, where the expectation is taken over θ ∼ π. If this quantity has optimal value ?n ,qthen one is
√
immediately guaranteed that for the population P , we have Eπ [P `(θ; X, y)] ≤ n + C log 1δ / n.
?
Langford and Caruana [126] take this approach, and Dziugaite and Roy [79] use it to give (the
first) non-trivial bounds for deep learning models.
The questions of interactive data analysis begin at least several decades ago, perhaps most pro-
foundly highlighted positively by Tukey’s Exploratory Data Analysis [168]. Problems of scientific
replicability have, conversely, highlighted many of the challenges of reusing data or peeking, even
innocently, at samples before performing statistical analyses [113, 86, 91]. Our approach to for-
malizing these ideas, and making rigorous limiting information leakage, draws from a more recent
strain of work in the theoretical computer science literature, with major contributions from Dwork,
Feldman, Hardt, Pitassi, Reingold, and Roth and Bassily, Nissim, Smith, Steinke, Stemmer, and
Ullman [78, 76, 77, 20, 21]. Our particular treatment most closely follows Feldman and Steinke [82].
The problems these techniques target also arise frequently in high-dimensional statistics, where one
often wishes to estimate uncertainty and perform inference after selecting a model. While we do
not touch on these problems, a few references in this direction include [25, 166, 109].
5.5 Exercises
Exercise 5.1 (Duality in Donsker-Varadhan): Here, we give a converse result to Theorem 5.1.1,
showing that for any function h : X → R,
where the supremum is taken over probability measures. If Q has a density, the supremum may be
taken over probability measures having a density.
(a) Show the equality (5.5.1) in the case that X is discrete by directly computing the supremum.
(That is, let |X | = k, and identify probability measures P and Q with vectors p, q ∈ Rk+ .)
(b) Let Q have density q. Assume that EQ [eh(X) ] < ∞ and let
(c) If EQ [eh(X) ] = +∞, then monotone convergence implies that limB↑∞ EQ [emin{B,h(X)} ] = +∞.
Conclude (5.5.1).
124
Lexture Notes on Statistics and Information Theory John Duchi
Exercise 5.2 (An alternative PAC-Bayes bound): Let f : Θ × X → R, and let π0 be a density
on θ ∈ Θ. Use the dual form (5.5.1) of the variational representation of the KL-divergence show
iid
that with probability at least 1 − δ over the draw of X1n ∼ P ,
The function ψ(t) = min{1, max{−1, t}} (the truncation of t to the range [−1, 1]) is such a function.
Let πθ be the normal distribution N(θ, σ 2 I) and π0 be N(0, σ 2 I).
(a) Let λ > 0. Use Exercise 5.2 to show that with probability at least 1 − δ, for all θ ∈ Rd
1
Z kθk2 /2σ 2 + log 1
Pn ψ(λhθ0 , Xi)πθ (θ0 )dθ0 ≤ hθ, E[X]i + λ θ> Σθ + σ 2 tr(Σ) + 2 δ
.
λ nλ
125
Lexture Notes on Statistics and Information Theory John Duchi
Exercise 5.4 (Large-margin PAC-Bayes bounds for multiclass problems): Consider the following
multiclass prediction scenario. Data comes in pairs (x, y) ∈ bBd2 × [k] where Bd2 = {v ∈ Rd | kvk2 ≤
1} denotes the `2 -ball and [k] = {1, . . . , k}. We make predictions using predictors θ1 , . . . , θk ∈ Rd ,
where the prediction of y on an example x is
We suffer an error whenever yb(x) 6= y, and the margin of our classifier on pair (x, y) is
If hθy , xi > hθi , xi for all i 6= y, the margin is then positive (and the prediction is correct).
(a) Develop an analogue of the bounds in Section 5.2.2 in this k-class multiclass setting. To do
so, you should (i) define the analogue of the margin-based loss `γ , (ii) show how Gaussian
perturbations leave it similar, and (iii) prove an analogue of the bound in Section 5.2.2. You
should assume one of the two conditions
k
X
(C1) kθi k2 ≤ r for all i (C2) kθi k22 ≤ kr2
i=1
(b) Describe a minimization procedure—just a few lines suffice—that uses convex optimization to
find a (reasonably) large-margin multiclass classifier.
Exercise 5.5 (A variance-based information bound): Let Φ = {φt }t∈T be a collection of functions
φt : X → R, where each φt satisfies the Bernstein condition (4.1.7) with parameters σ 2 (φt ) and b,
that is, |E[(φt (X) − P φt (X))k ]| ≤ k! 2
2 σ (φt )b
k−2 for all k ≥ 3 and Var(φ (X)) = σ 2 (φ ). Let T ∈ T
t t
be any random variable, which may depend on an observed sample X1n . Show that for all C > 0
C
and |λ| ≤ 2b , then
Pn φT − P φT 1
E ≤ I(T ; X1n ) + |λ|.
max{C, σ(φT )} n|λ|
Exercise 5.6 (An information bound on variance): Let Φ = {φt }t∈T be a collection of functions
φt : X → R, where each φt : X → [−1, 1]. Let σ 2 (φt ) = Var(φt (X)). Let s2n (φ) = Pn φ2 − (Pn φ)2 be
the sample variance of φ. Show that for all C > 0 and 0 ≤ λ ≤ C/4, then
s2n (φT )
1
E ≤ I(T ; X1n ) + 2.
max{C, σ 2 (φT )} nλ
The max{C, σ 2 (φT )} term is there to help avoid division by 0. Hint: If 0 ≤ x ≤ 1, then
ex ≤ 1 + 2x, and if X ∈ [0, 1], then E[eX ] ≤ 1 + 2E[X] ≤ e2E[X] . Use this to argue that
2 2
E[eλnPn (φ−P φ) / max{C,σ } ] ≤ e2λn for any φ : X → [−1, 1] with Var(φ) ≤ σ 2 , then apply the
Donsker-Varadhan theorem.
Exercise 5.7: Consider the following scenario: let φ : X → [−1, 1] and let α > 0, τ > 0. Let
µ = Pn φ and s2 = Pn φ2 − µ2 . Define σ 2 = max{αs2 , τ 2 }, and assume that τ 2 ≥ 5α
n .
126
Lexture Notes on Statistics and Information Theory John Duchi
(b) Show that if α2 ≤ C 0 τ 2 for a numerical constant C 0 < ∞, then we can take ε ≤ O(1) n21α .
Hint: Use exercise 2.14, and consider the “alternative” mechanisms of sampling from
2 2
N(µ−i , σ−i ) where σ−i = max{αs2−i , τ 2 }
for
1 X 1 X
µ−i = φ(Xj ) and s2−i = φ(Xj )2 − µ2−i .
n−1 n−1
j6=i j6=i
Input: Sample X1n ∈ X n drawn i.i.d. P , collection {φt }t∈T of possible queries φt : X →
[−1, 1], parameters α > 0 and τ > 0
Repeat: for k = 1, 2, . . .
iii. Mechanism draws independent Zk ∼ N(0, σk2 ) and responds with answer
n
1X
Ak := Pn φ + Zk = φ(Xi ) + Zk .
n
i=1
Exercise 5.8 (A general variance-dependent bound on interactive queries): Consider the algo-
rithm in Fig. 5.3. Let σ 2 (φt ) = Var(φt (X)) be the variance of φt .
(a) Show that for b > 0 and for all 0 ≤ λ ≤ 2b ,
r
|Aj − P φTj | τ2
1 n k
p 4α
E max ≤ I(X1 ; T1 ) + λ + 2 log(ke) 2
I(X1n ; T1k ) + 2α + 2 .
j≤k max{b, σ(φTj )} nλ nb b
(If you do not have quite the right constants, that’s fine.)
(b) Using the result of Question 5.7, show that with appropriate choices for the parameters
α, b, τ 2 , λ that for a numerical constant C < ∞
" #
|Aj − P φTj | (k log k)1/4
E max √ ≤ C √ .
j≤k max{(k log k)1/4 / n, σ(φTj )} n
You may assume that k, n are large if necessary.
(c) Interpret the result from part (b). How does this improve over Theorem 5.3.11?
127
Chapter 6
128
Lexture Notes on Statistics and Information Theory John Duchi
defined whenever Z ≥ 0 with probability 1. As our particular focus throughout this chapter, we
consider the moment generating function and associated transformation X 7→ eλX . If we know the
moment generating function ϕX (λ) := E[eλX ], then ϕ0X (λ) = E[XeλX ], and so
This suggests—in a somewhat roundabout way we make precise—that control of the entropy H(eλX )
should be sufficient for controlling the moment generating function of X.
The Herbst argument makes this rigorous.
Proposition 6.1.2. Let X be a random variable and assume that there exists a constant σ 2 < ∞
such that
λ2 σ 2
H(eλX ) ≤ ϕX (λ). (6.1.3)
2
for all λ ∈ R (respectively, λ ∈ R+ ) where ϕX (λ) = E[eλX ] denotes the moment generating function
of X. Then 2 2
λ σ
E[exp(λ(X − E[X]))] ≤ exp
2
for all λ ∈ R (respectively, λ ∈ R+ ).
Proof Let ϕ = ϕX for shorthand. The proof procedes by an integration argument, where we
2 2
show that log ϕ(λ) ≤ λ 2σ . First, note that
ϕ0 (λ) = E[XeλX ],
λ2 σ 2
λϕ0 (λ) − ϕ(λ) log ϕ(λ) = H(eλX ) ≤ ϕ(λ),
2
and dividing both sides by λ2 ϕ(λ) yields the equivalent statement
ϕ0 (λ) 1 σ2
− 2 log ϕ(λ) ≤ .
λϕ(λ) λ 2
∂ 1 ϕ0 (λ) 1
log ϕ(λ) = − 2 log ϕ(λ).
∂λ λ λϕ(λ) λ
129
Lexture Notes on Statistics and Information Theory John Duchi
It is possible to give a similar argument for sub-exponential random variables, which allows us
to derive Bernstein-type bounds, of the form of Corollary 4.1.18, but using the entropy method. In
particular, in the exercises, we show the following result.
Proposition 6.1.3. Assume that there exist positive constants b and σ such that
be the collection of all variables except Xi . Our first result is a consequence of the chain rule for
entropy and is known as Han’s inequality.
130
Lexture Notes on Statistics and Information Theory John Duchi
Proof The proof is a consequence of the chain rule for entropy and that conditioning reduces
entropy. We have
We also require a divergence version of Han’s inequality, which will allow us to relate the entropy
H of a random variable to divergences and other information-theoretic quantities. Let X be an
arbitrary space, and let Q be a distribution over X n and P = P1 ×· · ·×Pn be a product distribution
on the same space. For A ⊂ X n−1 , define the marginal densities
Proof We have seen earlier in the notes (recall the definition (2.2.1) of the KL divergence as
a supremum over all quantizers and the surrounding discussion) that it is no loss of generality to
assume that X is discrete. Thus, noting that the probability mass functions
X Y
q (i) (x\i ) = q(xi−1 n (i)
1 , x, xi+1 ) and p (x\i ) = pj (xj ),
x j6=i
Now, by subtracting q(xn1 ) log p(xn1 ) from both sides of the preceding display, we obtain
X X
(n − 1)Dkl (Q||P ) = (n − 1) q(xn1 ) log q(xn1 ) − (n − 1) q(xn1 ) log p(xn1 )
xn
1 xn
1
n X
X X
≥ q (i) (x\i ) log q (i) (x\i ) − (n − 1) q(xn1 ) log p(xn1 ).
i=1 x\i xn
1
131
Lexture Notes on Statistics and Information Theory John Duchi
We expand the final term. Indeed, by the product nature of the distributions p, we have
X X n
X
(n − 1) q(xn1 ) log p(xn1 ) = (n − 1) q(xn1 ) log pi (xi )
xn
1 xn
1 i=1
n X
X X n X
X
= q(xn1 ) log pi (xi ) = q (i) (x\i ) log p(i) (x\i ).
i=1 xn
1 j6=i i=1 x\i
| {z }
=log p(i) (x\i )
Noting that
X X
q (i) (x\i ) log q (i) (x\i ) − q (i) (x\i ) log p(i) (x\i ) = Dkl Q(i) ||P (i)
x\i x\i
Finally, we will prove the main result of this subsection: a tensorization identity for the entropy
H(Y ) for an arbitrary random variable Y that is a function of n independent random variables.
For this result, we use a technique known as tilting, in combination with the two variants of Han’s
inequality we have shown, to obtain the result. The tilting technique is one used to transform
problems of random variables into one of distributions, allowing us to bring the tools of information
and entropy to bear more directly. This technique is a common one, and used frequently in
large deviation theory, statistics, for heavy-tailed data, amont other areas. More concretely, let
Y = f (X1 , . . . , Xn ) for some non-negative function f . Then we may always define a tilted density
f (x1 , . . . , xn )p(x1 , . . . , xn )
q(x1 , . . . , xn ) := (6.1.5)
EP [f (X1 , . . . , Xn )]
which, by inspection, satisfies q(xn1 ) = 1 and q ≥ 0. In our context, if f ≈ constant under the
R
distribution P , then we should have f (xn1 )p(xn1 ) ≈ cp(xn1 ) and so Dkl (Q||P ) should be small; we
can make this rigorous via the following tensorization theorem.
Proof Inequality (6.1.6) holds for Y if and only if holds identically for cY for any c > 0, so
we assume without loss of generality that EP [Y ] = 1. We thus obtain that H(Y ) = E[Y log Y ] =
E[φ(Y )], where assign φ(t) = t log t. Let P have density p with respect to a base measure µ. Then
by defining the tilted distribution (density) q(xn1 ) = f (xn1 )p(xn1 ), we have Q(X n ) = 1, and moreover,
we have
q(xn1 )
Z Z
n
Dkl (Q||P ) = q(x1 ) log dµ(x1 ) = f (xn1 )p(xn1 ) log f (xn1 )dµ(xn1 ) = EP [Y log Y ] = H(Y ).
n
p(xn1 )
132
Lexture Notes on Statistics and Information Theory John Duchi
E[φ(Y )] − E[φ(E[Y | X\i ])] = E[E[φ(Y ) | X\i ] − φ(E[Y | X\i ])] = E[H(Y | X\i )].
Using Han’s inequality for relative entropies (Proposition 6.1.4) then immediately gives
n h
X n
i X
H(Y ) = Dkl (Q||P ) ≤ Dkl (Q||P ) − Dkl Q(i) ||P (i) = E[H(Y | X\i )],
i=1 i=1
Theorem 6.1.6 shows that if we can show that individually the conditional entropies H(Y | X\i )
are not too large, then the Herbst argument (Proposition 6.1.2 or its variant Proposition 6.1.3)
allows us to provide strong concentration inequalities for general random variables Y .
sup f (x1 , . . . , xi−1 , x, xi+1 , . . . , xn ) − f (x1 , . . . , xi−1 , x0 , xi+1 , . . . , xn ) ≤ ci for all x\i .
x∈X ,x0 ∈X
(6.1.7)
Then we have the following result.
Proposition 6.1.7 (Bounded differences). Assume that f satisfies the bounded differences condi-
1 Pn
tion (6.1.7), where 4 i=1 ci ≤ σ 2 . Let Xi be independent. Then Y = f (X1 , . . . , Xn ) is σ 2 -sub-
2
Gaussian.
Proof We use a similar integration argument to the Herbst argument of Proposition 6.1.2, and
we apply the tensorization inequality (6.1.6). First, let U be an arbitrary random variable taking
values in [a, b]. We claim that if ϕU (λ) = E[eλU ] and ψ(λ) = log ϕU (λ) is its cumulant generating
function, then
H(eλU ) λ2 (b − a)2
≤ . (6.1.8)
E[eλU ] 8
133
Lexture Notes on Statistics and Information Theory John Duchi
Indeed, we have that Y = k ni=1 Xi k2 satisfies the bounded differences inequality with param-
P
eters ci , and so
X n X n Xn X n
P Xi ≥ t = P Xi − E Xi ≥ t − E Xi
i=1 2 i=1 2 i=1 2 i=1 2
Ek i=1 Xi k2 ]2+
Pn !
[t −
≤ exp −2 Pn 2 .
i=1 ci
Pn q P qP
n n 2
i=1 E[kXi k2 ] gives the result. 3
2
Noting that E[k i=1 Xi k2 ] ≤ E[k i=1 Xi k2 ] =
134
Lexture Notes on Statistics and Information Theory John Duchi
Theorem 6.1.9. Let X1 , . . . , Xn be independent random variables with Xi ∈ [a, b] for all i. Assume
that f : Rn → R is separately convex and L-Lipschitz with respect to the k·k2 norm. Then
We defer the proof of the theorem temporarily, giving two example applications. The first is to
the matrix concentration problem that motivates the beginning of this section.
Example 6.1.10: Let X ∈ Rm×n be a matrix with independent entries, where Xij ∈ [−1, 1]
for all i, j, and let |||·||| denote the operator norm on matrices, that is, |||A||| = supu,v {u> Av :
kuk2 ≤ 1, kvk2 ≤ 1}. Then Theorem 6.1.9 implies
2
t
P(|||X||| ≥ E[|||X|||] + t) ≤ exp −
16
where k·kFr denotes the Frobenius norm of a matrix. Thus the matrix operator norm is 1-
Lipschitz. Therefore, we have by Theorem 6.1.9 and the Chernoff bound technique that
As a second example, we consider Rademacher complexity. These types of results are important
for giving generalization bounds in a variety of statistical algorithms, and form the basis of a variety
of concentration and convergence results. We defer further motivation of these ideas to subsequent
chapters, just mentioning here that we can provide strong concentration guarantees for Rademacher
complexity or Rademacher chaos.
Example 6.1.11: Let A ⊂ Rn be any collection of vectors. The the Rademacher complexity
of the class A is
n
" #
X
Rn (A) := E sup a i εi , (6.1.9)
a∈A i=1
t2
P(Rn (A) ≥ Rn (A) + t) ≤ exp −
b ,
16 diam(A)2
where diam(A) := supa∈A kak2 . Indeed, we have that ε 7→ supa∈A a> ε is a convex function,
as it is the maximum of a family of linear functions. Moreover, it is Lipschitz, with Lipschitz
constant bounded by supa∈A kak2 . Applying Theorem 6.1.9 as in Example 6.1.10 gives the
result. 3
Proof of Theorem 6.1.9 The proof relies on our earlier tensorization identity and a sym-
metrization lemma.
135
Lexture Notes on Statistics and Information Theory John Duchi
iid
Lemma 6.1.12. Let X, Y ∼ P be independent. Then for any function g : R → R, we have
Proof For the first result, we use the convexity of the exponential in an essential way. In
particular, we have
because log is concave and ex ≥ 0. Using symmetry, that is, that g(X) − g(Y ) has the same
distribution as g(Y ) − g(X), we then find
1
H(eλg(X) ) ≤ E[λ(g(X)−g(Y ))(eλg(X) −eλg(Y ) )] = E[λ(g(X)−g(Y ))(eλg(X) −eλg(Y ) )1 {g(X) ≥ g(Y )}].
2
Now we use the classical first order convexity inequality—that a convex function f satisfies f (t) ≥
f (s)+f 0 (s)(t−s) for all t and s, Theorem B.3.3 in the appendices—which gives that et ≥ es +es (t−s)
for all s and t. Rewriting, we have es − et ≤ es (s − t), and whenever s ≥ t, we have (s − t)(es − et ) ≤
es (s − t)2 . Replacing s and t with λg(X) and λg(Y ), respectively, we obtain
λ(g(X) − g(Y ))(eλg(X) − eλg(Y ) )1 {g(X) ≥ g(Y )} ≤ λ2 (g(X) − g(Y ))2 eλg(X) 1 {g(X) ≥ g(Y )} .
Returning to the main thread of the proof, we note that the separate convexity of f and the
tensorization identity of Theorem 6.1.6 imply
n n
X X " 2 #
λf (X1:n ) λf (X1:n ) 2 2 ∂ λf (X1:n )
H(e )≤E H(e | X\i ) ≤ E λ E (Xi − Yi ) f (X1:n ) e | X\i ,
∂xi
i=1 i=1
where Yi are independent copies of the Xi . Now, we use that (Xi −Yi )2 ≤ (b−a)2 and the definition
of the partial derivative to obtain
Noting that k∇f (X)k22 ≤ L2 , and applying the Herbst argument, gives the result.
136
Lexture Notes on Statistics and Information Theory John Duchi
so that we “project out” the jth coordinate, and define the projected sets.
137
Chapter 7
In this chapter, we continue to build on our ideas on stability in different scenarios, ranging from
model fitting and concentration to interactive data analyses. Here, we show how stability ideas
allow us to provide a new type of protection: the privacy of participants in studies. Until the mid-
2000s, the major challenge in this direction had been a satisfactory definition of privacy, because
collection of side information often results in unforeseen compromises of private information. The
introduction of differential privacy—a type of stability in likelihood ratios for data releases from
differing samples—alleviated these challenges, providing a firm foundation on which to build private
estimators and other methodology. (Though it is possible to trace some of the definitions and major
insights in privacy back at least to survey sampling literature in the 1960s.) Consequently, in this
chapter we focus on privacy notions based on differential privacy and its cousins, developing the
information-theoretic stability ideas helpful to understand the protections it is possible to provide.
138
Lexture Notes on Statistics and Information Theory John Duchi
perform a study on smoking, and discover that smoking causes cancer. We publish the result, but
now we have “compromised” the privacy of everyone who smokes who did not participate in the
study: we know they are more likely to get cancer.
In each of these cases, the biggest challenge is one of side information: how can we be sure
that, when releasing a particular statistic, dataset, or other quantity that no adversary will be able
to infer sensitive data about participants in our study? We articulate three desiderata that—we
believe—suffice for satisfactory definitions of privacy. In discussion of private releases of data, we
require a bit of vocabulary. We term a (randomized) algorithm releasing data either a privacy
mechanism, consistent with much of the literature in privacy, or a channel, mapping from the input
sample to some output space, in keeping with our statistical and information-theoretic focus. In
no particular order, we wish our privacy mechanism, which takes as input a sample X1n ∈ X n and
releases some Z to satisfy the following.
i. Given the output Z, even an adversary knowing everyone in the study (excepting one person)
should not be able to test whether you belong to the study.
ii. If you participate in multiple “private” studies, there should be some graceful degradation
in the privacy protections, rather than a catastrophic failure. As part of this, any definition
should guarantee that further processing of the output Z of a private mechanism X1n → Z, in
the form of the Markov chain X1n → Z → Y , should not allow further compromise of privacy
(that is, a data-processing inequality). Additional participation in “private” studies should
continue to provide little additional information.
iii. The mechanism X1n → Z should be resilient to side information: even if someone knows
something about you, he should learn little about you if you belong to X1n , and this should
remain true even if the adversary later gleans more information about you.
The third desideratum is perhaps most elegantly phrased via a Bayesian perspective, where an
adversary has some prior beliefs π on the membership of a dataset (these prior beliefs can then
capture any side information the adversary has). The strongest adversary has a prior supported on
two samples {x1 , . . . , xn } and {x01 , . . . , x0n } differing in only a single element; a private mechanism
would then guarantee the adversary’s posterior beliefs (after the release X1n → Z) should not change
significantly.
Before continuing addressing these challenges, we take a brief detour to establish notation for the
remainder of the chapter. It will be convenient to consider randomized procedures acting on samples
n 1 Pn
themselves; a sample x1 is cleary isomorphic to the empirical distribution Pn = n i=1 1xi , and
for two empirical distributions Pn and Pn0 supported on {x1 , . . . , xn } and {x01 , . . . , x0n }, we evidently
have
n Pn − Pn0 TV = dham ({x1 , . . . , xn }, {x01 , . . . , x0n }),
and so we will identify samples with their empirical distributions. With this notational convenience
in place, we then identify
n
( )
1X
Pn = Pn = 1xi | xi ∈ X
n
i=1
as the set of all empirical distributions on n points in X and we also abuse notation in an obvious
way to define dham (Pn , Pn0 ) := n kPn − Pn0 kTV as the number of differing observations in the samples
Pn and Pn0 represent. A mechanism M is then a (typically) randomized mapping M : Pn → Z,
139
Lexture Notes on Statistics and Information Theory John Duchi
which we can identify with its induced Markov channel Q from X n → Z; we use the equivalent
views as is convenient.
The challenges of side information motivate Dwork et al.’s definition of differential privacy [74].
The key in differential privacy is that the noisy channel releasing statistics provides guarantees of
bounded likelihood ratios between neighboring samples, that is, samples differing in only a single
entry.
Definition 7.1 (Differential privacy). Let M : Pn → Z be a randomized mapping. Then M is
ε-differentially private if for all (measurable) sets S ⊂ Z and all Pn , Pn0 ∈ Pn with dham (Pn , Pn0 ) ≤ 1,
P(M (Pn ) ∈ S)
≤ eε . (7.1.1)
P(M (Pn0 ) ∈ S)
The intuition and original motivation for this definition are that an individual has little incentive
to participate (or not participate) in a study, as the individual’s data has limited effect on the
outcome.
The model (7.1.1) of differential privacy presumes that there is a trusted curator, such as a
hospital, researcher, or corporation, who can collect all the data into one centralized location, and
it is consequently known as the centralized model. A stronger model of privacy is the local model,
in which data providers trust no one, not even the data collector, and privatize their individual
data before the collector even sees it.
Definition 7.2 (Local differential privacy). A channel Q from X to Z is ε-locally differentially
private if for all measurable S ⊂ Z and all x, x0 ∈ X ,
Q(Z ∈ S | x)
≤ eε . (7.1.2)
Q(Z ∈ S | x0 )
It is clear that Definition 7.2 and the condition (7.1.2) are stronger than Definition 7.1: when
samples {x1 , . . . , xn } and {x01 , . . . , x0n } differ in at most one observation, then the local model (7.1.2)
guarantees that the densities
n
dQ(Z1n | {xi }) Y dQ(Zi | xi )
= ≤ eε ,
dQ(Z1n | {x0i }) dQ(Zi | x0i )
i=1
where the inequality follows because only a single ratio may contain xi 6= x0i .
In the remainder of this introductory section, we provide a few of the basic mechanisms in use
in differential privacy, then discuss its “semantics,” that is, its connections to the three desiderata
we outline above. In the coming sections, we revisit a few more advanced topics, in particular, the
composition of multiple private mechanisms and a few weakenings of differential privacy, as well as
more sophisticated examples.
140
Lexture Notes on Statistics and Information Theory John Duchi
estimate the proportion of the population with a characteristic (versus those without); call
these groups 0 and 1. Rather than ask the participant to answer the question specifically,
however, we give them a spinner with a face painted in two known areas, where the first
corresponds to group 0 and has area eε /(1 + eε ) and the second to group 1 and has area
1/(1 + eε ). Thus, when the participant spins the spinner, it lands in group 0 with probability
eε /(1 + eε ). Then we simply ask the participant, upon spinning the spinner, to answer “Yes”
if he or she belongs to the indicated group, “No” otherwise.
Let us demonstrate that this randomized response mechanism provides ε-local differential
privacy. Indeed, we have
Q(Yes | x = 0) Q(No | x = 0)
= e−ε and = eε ,
Q(Yes | x = 1) Q(No | x = 1)
so that Q(Z = z | x)/Q(Z = z | x0 ) ∈ [e−ε , eε ] for all x, z. That is, the randomized response
channel provides ε-local privacy. 3
The interesting question is, of course, whether we can still use this channel to estimate the
proportion of the population with the sensitive characteristic. Indeed, we can. We can provide
a somewhat more general analysis, however, which we now do so that we can give a complete
example.
Example 7.1.2 (Randomized response, continued): Suppose that we have an attribute of
interest, x, taking the values x ∈ {1, . . . , k}. Then we consider the channel (of Z drawn
conditional on x)
(
eε
x with probability k−1+eε
Z= k−1
Uniform([k] \ {x}) with probability k−1+eε .
satisfies E[b
pn ] = p, and we also have
2 k
2 X
eε + k − 1 eε + k − 1
1
pn − pk22 = cn ]k22
E kb E kb
cn − E[b = P(Z = j)(1−P(Z = j)).
eε − 1 n eε − 1
j=1
ε
pn − pk22 ] ≤ n1 ( e e+k−1 2
P
As j P(Z = j) = 1, we always have the bound E[kb ε −1 ) .
We may consider two regimes for simplicity: when ε ≤ 1 and when ε ≥ log k. In the former
case—the high privacy regime—we have k1 . P(Z = i) . k1 , so that the mean `2 squared error
2
scales as n1 kε2 . When ε ≥ log k is large, by contrast, we see that the error scales at worst as n1 ,
which is the “non-private” mean squared error. 3
141
Lexture Notes on Statistics and Information Theory John Duchi
While randomized response is essentially the standard mechanism in locally private settings, in
centralized privacy, the “standard” mechanism is Laplace noise addition because of its exponential
tails. In this case, we require a few additional definitions. Suppose that we wish to release some
d-dimensional function f (Pn ) of the sample distribution Pn (equivalently, the associated sample
X1n ), where f takes values in Rd . In the case that f is Lipschitz with respect to the Hamming
metric—that is, the counting metric on X n —it is relatively straightforward to develop private
mechanisms. To better reflect the nomenclature in the privacy literature and easier use in our
future development, for p ∈ [1, ∞] we define the global sensitivity of f by
n o
GSp (f ) := sup f (Pn ) − f (Pn0 ) p | dham (Pn , Pn0 ) ≤ 1 .
Pn ,Pn0 ∈Pn
This is simply the Lipschitz constant of f with respect to the Hamming metric. The global sensi-
tivity is a convenient metric, because it allows simple noise addition strategies.
Letting L = GS1 (f ) be the Lipschitz constant for simplicity, if we consider the mechanism
defined by the addition of W ∈ Rd with independent Laplace(L/ε) coordinates,
iid
Z := f (Pn ) + W, Wj ∼ Laplace(L/ε), (7.1.3)
we have that Z is ε-differentially private. Indeed, for samples Pn , Pn0 differing in at most a
single example, Z has density ratio
q(z | Pn ) ε ε ε
0
= exp − kf (Pn ) − zk1 + f (Pn0 ) − z 1
≤ exp f (Pn ) − f (Pn0 ) 1
≤ exp(ε)
q(z | Pn ) L L L
by the triangle inequality and that f is L-Lipschitz with respect to the Hamming metric. Thus
Z is ε-differentially private. Moreover, we have
2dGS1 (f )2
E[kZ − f (Pn )k22 ] = ,
ε2
so that if L is small, we may report the value of f accurately. 3
The most common instances and applications of the Laplace mechanism are in estimation of
means and histograms. Let us demonstrate more carefully worked examples in these two cases.
142
Lexture Notes on Statistics and Information Theory John Duchi
Example 7.1.4 (Private one-dimensional mean estimation): Suppose that we have variables
Xi taking values in [−b, b] for some b < ∞, and wish to estimate E[X]. A natural function to
n 1 Pn
release is then f (X1 ) = X n = n i=1 Xi . This has Lipschitz constant 2b/n with respect to
the Hamming metric, because for any two samples x, x0 ∈ [−b, b]n differing in only entry i, we
have
1 2b
|f (x) − f (x0 )| = |xi − x0i | ≤
n n
because xi ∈ [−b, b]. Thus the Laplace mechanism (7.1.3) with the choice variance W ∼
Laplace(2b/(nε)) yields
1 8b2 b2 8b2
E[(Z − E[X])2 ] = E[(X n − E[X])2 ] + E[(Z − X n )2 ] = Var(X) + 2 2 ≤ + 2 2.
n n ε n n ε
We can privately release means with little penalty so long as ε n−1/2 . 3
Example 7.1.5 (Private histogram (multinomial) release): Suppose that we wish to estimate
a multinomial distribution, or put differently, a histogram. That is, we have observations
X ∈ {1, . . . , k}, where k may be large, and wish to estimate pj := P(X = j) P for j = 1, . . . , k.
For a given sample xn1 , the empirical count vector pbn with coordinates pbn,j = n1 ni=1 1 {Xi = j}
satisfies
2
GS1 (b
pn ) =
n
0
because swapping a single example xi for xi may change the counts for at most two coordinates
j, j 0 by 1. Consequently, the Laplace noise addition mechanism
iid 2
Z = pbn + W, Wj ∼ Laplace
nε
satisfies
8k
E[kZ − pbn k22 ] =
n 2 ε2
and consequently
k
8k 1X 8k 1
E[kZ − pk22 ] = 2 2
+ pj (1 − pj ) ≤ 2 2 + .
n ε n n ε n
j=1
This example shows one of the challenges of differentially private mechanisms: even in the case
where the quantity of interest is quite stable (insensitive to changes in the underlying sample,
or has small Lipschitz constant), it may be the case that the resulting mechanism adds noise
that introduces some dimension-dependent scaling. In this case, the conditions on privacy
levels acceptable for good estimation—in that the P rate of convergence is no different from the
non-private case, which achieves E[kb pn − pk2 ] = n kj=1 pj (1 − pj ) ≤ n1 are that ε nk . Thus,
2 1
in the case that the histogram has a large number of bins, the naive noise addition strategy
cannot provide as much protection without sacrificing efficiency.
If instead of `2 -error we consider `∞ error, it is possible to provide somewhat more satisfying
iid
results in this case. Indeed, we know that P(kW k∞ ≥ t) ≤ k exp(−t/b) for Wj ∼ Laplace(b),
so that in the mechanism above we have
tnε
P(kZ − pbn k∞ ≥ t) ≤ k exp − all t ≥ 0,
2
143
Lexture Notes on Statistics and Information Theory John Duchi
so that the samples under H0 and H1 differ only in the ith observation Xi ∈ {xi , x0i }. Now, for a
channel taking inputs from X n and outputting Z ∈ Z, we define ε-conditional hypothesis testing
privacy by saying that
Q(Ψ(Z) = 1 | H0 , Z ∈ A) + Q(Ψ(Z) = 0 | H1 , Z ∈ A) ≥ 1 − ε (7.1.4)
for all sets A ⊂ Z satisfying Q(A | H0 ) > 0 and Q(A | H1 ) > 0. That is, roughly, no matter
what value Z takes on, the probability of error in a test of whether H0 or H1 is true—even with
knowledge of xj , j 6= i—is high. We then have the following proposition.
Proposition 7.1.6. Assume the channel Q is ε-differentially private. Then Q is also ε̄ = 1−e−2ε ≤
2ε-conditional hypothesis testing private.
Proof Let Ψ be any test of H0 versus H1 , and let B = {z | Ψ(z) = 1} be the acceptance region
of the test. Then
Q(A, B | H0 ) Q(A, B c | H1 )
Q(B | H0 , Z ∈ A) + Q(B c | H1 , Z ∈ A) = +
Q(A | H0 ) Q(A | H1 )
Q(A, B | H1 ) Q(A, B c | H1 )
≥ e−2ε +
Q(A | H1 ) Q(A | H1 )
Q(A, B | H1 ) + Q(A, B c | H1 )
≥ e−2ε ,
Q(A | H1 )
144
Lexture Notes on Statistics and Information Theory John Duchi
where the first inequality uses ε-differential privacy. Then we simply note that Q(A, B | H1 ) +
Q(A, B c | H1 ) = Q(A | H1 ).
So we see that (roughly), even conditional on the output of the channel, we still cannot test whether
the initial dataset was x or x0 whenever x, x0 differ in only a single observation.
An alternative perspective is to consider a Bayesian one, which allows us to more carefully
consider side information. In this case, we consider the following thought experiment. An adversary
has a set of prior beliefs π on X n , and we consider the adversary’s posterior π(· | Z) induced by
observing the output Z of some mechanism M . In this case, Bayes factors, which measure how
much prior and posterior distributions differ after observations, provide one immediate perspective.
π(Pn | z)
≤ eε
π(Pn0 | z)
Proof Let q be the associated density of Z = M (·) (conditional or marginal). We have π(Pn |
z) = q(z | Pn )π(Pn )/q(z). Then
Thus we see that private channels mean that prior and posterior odds between two neighboring
samples cannot change substantially, no matter what the observation Z actually is.
For an an alternative view, we consider a somewhat restricted family of prior distributions,
where we now take the view of a sample xn1 ∈ X n . There is some annoyance in this calculation
in that the order of the sample may be important, but it at least gets toward some semantic
interpretation of differential privacy. We consider the adversary’s beliefs on whether a particular
value x belongs to the sample, but more precisely, we consider whether Xi = x. We assume that
the prior density π on X n satisfies
are independent of his beliefs about the other members of the dataset. (We assume that π is
a density with respect to a measure µ on X n−1 × X , where dµ(s, x) = dµ(s)dµ(x).) Under the
condition (7.1.5), we have the following proposition.
Proposition 7.1.8. Let Q be an ε-differentially private channel and let π be any prior distribution
satisfying condition (7.1.5). Then for any z, the posterior density πi on Xi satisfies
145
Lexture Notes on Statistics and Information Theory John Duchi
Proof We abuse notation and for a sample s ∈ X n−1 , where s = (x1i−1 , xni+1 ), we let s ⊕i x =
(xi−1 n
1 , x, xi+1 ). Letting µ be the base measure on X
n−1 × X with respect to which π is a density
n
and q(· | x1 ) be the density of the channel Q, we have
R
s∈X n−1 q(z | s ⊕i x)π(s ⊕i x)dµ(s)
πi (x | Z = z) = R R
0 0 0
s∈X n−1 x0 ∈X q(z | s ⊕i x )π(s ⊕i x )dµ(s, x )
R
n−1 q(z | s ⊕i x)π(s ⊕i x)dµ(s)
(?)
s∈X
≤ eε R R
0 0
s∈X n−1 x0 ∈X q(z | s ⊕i x)π(s ⊕i x )dµ(s)dµ(x )
R
s∈X n−1 q(z | s ⊕i x)π\i (s)dµ(s)πi (x)
= eε R R
0 0
s∈X n−1 q(z | s ⊕i x)π\i (s)dµ(s) x0 ∈X πi (x )dµ(x )
= eε πi (x),
where inequality (?) follows from ε-differential privacy. The lower bound is similar.
Roughly, however, we see that Proposition 7.1.8 captures the idea that even if an adversary has
substantial prior knowledge—in the form of a prior distribution π on the ith value Xi and everything
else in the sample—the posterior cannot change much.
146
Lexture Notes on Statistics and Information Theory John Duchi
Definition 7.4. Let P and Q be distributions on a space X with densities p and q (with respect to
a measure µ). For α ∈ [1, ∞], the Rényi-α-divergence between P and Q is
p(x) α
Z
1
Dα (P ||Q) := log q(x)dµ(x).
α−1 q(x)
Here, the values α ∈ {1, ∞} are defined in terms of their respective limits.
1
Rényi divergences satisfy exp((α − 1)Dα(P ||Q)) = 1 + Df (P ||Q), i.e., Dα(P ||Q) = α−1 log(1 +
α
Df (P ||Q)), for the f -divergence defined by f (t) = t − 1, so that they inherit a number of the
properties of such divergences. We enumerate a few here for later reference.
Proposition 7.2.1 (Basic facts on Rényi divergence). Rényi divergences satisfy the following.
ii. limα↓1 Dα (P ||Q) = Dkl (P ||Q) and limα↑∞ Dα (P ||Q) = sup{t | Q(p(X)/q(X) ≥ t) > 0}.
iii. Let K(· | x) be a Markov kernel from X → Z as in Proposition 2.2.13, and let KP and KQ be
the induced marginals of P and Q under K, respectively. Then Dα (KP ||KQ ) ≤ Dα (P ||Q).
We leave the proof of this proposition as Exercise 7.1, noting that property i is a consequence
of Hölder’s inequality, property ii is by L’Hopital’s rule, and property iii is an immediate conse-
quence of Proposition 2.2.13. Rényi divergences also tensorize nicely—generalizing the tensoriza-
tion properties of KL-divergence and information of Chapter 2 (recall the chain rule (2.1.6) for
KL-divergence)—and we return to this later. As a preview, however, these tensorization proper-
ties allow us to prove that the composition of multiple private data releases remains appropriately
private.
With these preliminaries in place, we can then provide
Definition 7.5 (Rényi-differential privacy). Let ε ≥ 0 and α ∈ [1, ∞]. A channel Q from Pn to
output space Z is (ε, α)-Rényi private if for all neighboring samples Pn , Pn0 ∈ Pn ,
Clearly, any ε-differentially private channel is also (ε, α)-Rényi private for any α ≥ 1; as we soon
see, we can provide tighter guarantees than this.
Example 7.2.2 (Rényi divergence between Gaussian distributions): Consider normal distri-
butions N(µ0 , Σ) and N(µ1 , Σ). Then
α
Dα (N(µ0 , Σ)||N(µ1 , Σ)) = (µ0 − µ1 )T Σ−1 (µ0 − µ1 ). (7.2.3)
2
147
Lexture Notes on Statistics and Information Theory John Duchi
To see this equality, we compute the appriate integral of the densities. Let p and q be the
densities of N(µ0 , Σ) and N(µ1 , Σ), respectively. Then letting Eµ1 denote expectation over
X ∼ N(µ1 , Σ), we have
p(x) α
Z α
h α i
q(x)dx = Eµ1 exp − (X − µ0 )T Σ−1 (X − µ0 ) + (X − µ1 )T Σ−1 (X − µ1 )
q(x) 2 2
(i)
h α i
= Eµ1 exp − (µ0 − µ1 )T Σ−1 (µ0 − µ1 ) + α(µ0 − µ1 )T Σ−1 (X − µ1 )
2
α2
(ii) α T −1 T −1
= exp − (µ0 − µ1 ) Σ (µ0 − µ1 ) + (µ0 − µ1 ) Σ (µ0 − µ1 ) ,
2 2
where equality (i) is simply using that (x − a)2 − (x − b)2 = (a − b)2 + 2(b − a)(x − b) and
equality (ii) follows because (µ0 − µ1 )T Σ−1 (X − µ1 ) ∼ N(0, (µ1 − µ0 )T Σ−1 (µ1 − µ0 )) under
X ∼ N(µ1 , Σ). Noting that −α + α2 = α(α − 1) and taking logarithms gives the result. 3
Example 7.2.2 is the key to developing different privacy-preserving schemes under Rényi privacy.
Let us reconsider Example 7.1.3, except that instead of assuming the function f of interest is smooth
with respect to `1 norm, we use the `2 -norm.
Example 7.2.3 (Gaussian mechanisms): Suppose that f : Pn → Rd has Lipschitz constant
L with respect to the `2 -norm (for the Hamming metric dham ), that is, global `2 -sensitivity
Z = f (Pn ) + W, W ∼ N(0, σ 2 I)
satisfies
α 2 α
Dα N(f (Pn ), σ 2 )||N(f (Pn0 ), σ 2 ) = 2 f (Pn ) − f (Pn0 ) 2 ≤ 2 L2
2σ 2σ
0
for neighboring samples Pn , Pn . Thus, if we have Lipschitz constant L and desire (ε, α)-Rényi
2
privacy, we may take σ 2 = L2εα , and then the mechanism
L2 α
Z = f (Pn ) + W W ∼ N 0, I (7.2.4)
2ε
satisfies (ε, α)-Rényi privacy. 3
Certain special cases can make this more concrete. Indeed, suppose we wish to estimate a mean
iid
E[X] where Xi ∼ P for some distribution P such that kXi k2 ≤ r with probability 1 for some
radius.
Example 7.2.4 (Bounded mean estimation with Gaussian mechanisms): Letting f (X1n ) =
X n be the sample mean, where Xi satisfy kXi k2 ≤ r as above, we see immediately that
2r
GS2 (f ) = .
n
2r
In this case, the Gaussian mechanism (7.2.4) with L = n yields
h
2
i 2dr2 α
E Z − Xn 2
= E[kW k22 ] = .
n2 ε
148
Lexture Notes on Statistics and Information Theory John Duchi
Then we have
r2 2dr2 α
E[kZ − E[X]k22 ] = E[kX n − E[X]k22 ] + E[kZ − X n k22 ] ≤ + 2 .
n n ε
It is not immediately apparent how to compare this quantity to the case for the Laplace mech-
anism in Example 7.1.3, but we will return to this shortly once we have developed connections
between the various privacy notions we have developed. 3
Proposition 7.2.5. Let ε ≥ 0 and let P and Q be distributions such that e−ε ≤ P (A)/Q(A) ≤ eε
for all measurable sets A. Then for any α ∈ [1, ∞],
3α 2
Dα (P ||Q) ≤ min ε ,ε .
2
Corollary 7.2.6. Let ε ≥ 0 and assume that Q is ε-differentially private. Then for any α ≥ 1, Q
is (min{ 3α 2
2 ε , ε}, α)-Rényi private.
Before proving the proposition, let us see its implications for Example 7.2.4 versus estimation
under ε-differential privacy. Let ε ≤ 1, so that roughly to have “similar” privacy, we require
0 2
that our Rényi private channels
√ √ | x)||Q(· | x )) ≤ ε . The `1 -sensitivity of the mean
satisfy Dα (Q(·
satisfies kxn − x0 n k1 ≤ dkxn − x0 n k2 ≤ 2 dr/n for neighboring samples. Then the Laplace
mechanism (7.1.3) satisfies
2 8r2
E[kZLaplace − E[X]k22 ] = E[ X n − E[X] 2 ] + · d2 ,
n 2 ε2
while the Gaussian mechanism under (ε2 , α)-Rényi privacy will yield
2 2r2
E[kZGauss − E[X]k22 ] = E[ X n − E[X] 2 ] + · dα.
n 2 ε2
This is evidently better than the Laplace mechanism whenever α < d.
Proof of Proposition 7.2.5 We asume that P and Q have densities p and q with respect to a
base measure µ, which is no loss of generality, whence the ratio condition implies that e−ε ≤ p/q ≤ eε
1 α
R
and Dα (P ||Q) = α−1 log (p/q) qdµ. We prove the result assuming that α ∈ (1, ∞), as continuity
gives the result for α ∈ {1, ∞}.
First, it is clear that Dα (P ||Q) ≤ ε always. For the other term in the minimum, let us assume
that α ≤ 1 + 1ε and ε ≤ 1. If either of these fails, the result is trivial, because for α > 1 + 1ε we
have 32 αε2 ≥ 32 ε ≥ ε, and similarly ε ≥ 1 implies 23 αε2 ≥ ε.
149
Lexture Notes on Statistics and Information Theory John Duchi
Now we perform a Taylor approximation of t 7→ (1 + t)α . By Taylor’s theorem, we have for any
t > −1 that
α(α − 1)
(1 + t)α = 1 + αt + t)α−2 t2
(1 + e
2
p(z) α
Z
exp ((α − 1)Dα (P ||Q)) = q(z)dµ(z)
q(z)
Z α
p(z)
= 1+ − 1 q(z)dµ(z)
q(z)
Z Z 2
p(z) α(α − 1) p(z)
≤1+α − 1 q(z)dµ(z) + max{1, exp(ε(α − 2))} − 1 q(z)dµ(z)
q(z) 2 q(z)
α(α − 1) ε[α−2]+
≤1+ e · (eε − 1)2 .
2
Now, we know that α − 2 ≤ 1/ε − 1 by assumption, so using that log(1 + x) ≤ x, we obtain
α ε
Dα (P ||Q) ≤ (e − 1)2 · exp([1 − ε]+ ).
2
3α 2
Finally, a numerical calculation yields that this quantity is at most 2 ε for ε ≤ 1.
We can also provide connections from (ε, α)-Rényi privacy to (ε, δ)-differential privacy, and
then from there to ε-differential privacy. We begin by showing how to develop (ε, δ)-differential
privacy out of Rényi privacy. Another way to think about this proposition is that whenever two
distributions P and Q are close in Rényi divergence, then there is some limited “amplification” of
probabilities that is possible in moving from one to the other.
Proposition 7.2.7. Let P and Q satisfy Dα (P ||Q) ≤ ε. Then for any set A,
α−1 α−1
P (A) ≤ exp ε Q(A) α .
α
Before turning to the proof of the proposition, we show how it can provide prototypical (ε, δ)-
private mechanisms via Gaussian noise addition.
150
Lexture Notes on Statistics and Information Theory John Duchi
2 r2 1
E[kZGauss − E[X]k22 ] = E[ X n − E[X] 2 ] + O(1) 2 2
· d log .
n ε δ
Comparing to the previous cases, we see an improvement over the Laplace mechanism whenever
log 1δ d, or that δ e−d . 3
Proof of Proposition 7.2.7 We use the data processing inequality of Proposition [Link],
which shows that
P (A) α
1
ε ≥ Dα (P ||Q) ≥ log Q(A) .
α−1 Q(A)
Rearranging and taking exponentials, we immediately obtain the first claim of the proposition.
α
For the second, we require a bit more work. First, let us assume that Q(A) > e−ε δ α−1 . Then
we have by the first claim of the proposition that
α−1 1 1
P (A) ≤ exp ε + log Q(A)
α α Q(A)
α−1 1 1 1 1 1
≤ exp ε+ ε+ log Q(A) = exp ε + log Q(A).
α α α−1 δ α−1 δ
α
On the other hand, when Q(A) ≤ e−ε δ α−1 , then again using the first result of the proposition,
α−1
P (A) ≤ exp (ε + log Q(A))
α
α − 1 α
≤ exp ε−ε+ log δ = δ.
α α−1
151
Lexture Notes on Statistics and Information Theory John Duchi
Finally, we develop our last set of connections, which show how we may relate (ε, δ)-private
channels with ε-private channels. To provide this definition, we require one additional weakened
notion of divergence, which relates (ε, δ)-differential privacy to Rényi-α-divergence with α = ∞.
We define
δ P (S) − δ
D∞ (P ||Q) := sup log | P (S) > δ ,
S⊂X Q(S)
where the supremum is over measurable sets. Evidently equivalent to this definition is that
δ (P ||Q) ≤ ε if and only if
D∞
P (S) ≤ eε Q(S) + δ for all S ⊂ X .
Then we have the following lemma.
Lemma 7.2.10. Let ε > 0 and δ ∈ (0, 1), and let P and Q be distributions on a space X .
δ (P ||Q) ≤ ε if and only if there exists a probability distribution R on X such that
(i) We have D∞
kP − RkTV ≤ δ and D∞ (R||Q) ≤ ε.
(ii) We have D∞δ (P ||Q) ≤ ε and D δ (Q||P ) ≤ ε if and