EE/Stats 376A: Information theory Winter 2017
Lecture 15 — March 2
Lecturer: David Tse Scribe: Vihan Lakshman, Burak B, Aykan O, Tung-Yu W, David Z
15.1 Outline
• Differential Entropy, Divergence, and Mutual Information
• Entropy Maximization
• Capacity of Gaussian Channels
15.2 Recap - Differential Entropy
Last lecture, we introduced a notion of entropy in the setting of continuous random variables
known as differential entropy. For a continuous random variable Y , the differential entropy
of Y is defined as
1
h(Y ) , E log
f (Y )
and, for another continuous random variable X, we defined the conditional differential en-
tropy as
1
h(Y |X) , E log .
f (Y |X)
Recall that h(Y ), h(Y |X) have units unlike their discrete counterpart. Let’s now check other
interesting properties.
15.2.1 Mutual Information: Label Invariance
Theorem 1. For a constant a, I(aX; Y ) = I(X; Y ).
Proof. Recall from last time,
I(aX; Y ) = h(aX) − h(aX|Y ).
Furthermore, we showed last time time that h(aX) = h(X) + log |a|. Therefore,
I(aX; Y ) = h(X) + log |a| − h(aX|Y ).
To proceed from here, we need to get a handle on the term h(aX|Y ). It will help to reason
by analogy and remember that in the discrete case we have
X
H(X|Y ) = H(X|Y = y)p(y).
y
15-1
EE/Stats 376A Lecture 15 — March 2 Winter 2017
For continuous random variables, recall that sums are replaced by integrals and probabilities
by density functions. Making these modifications, we arrive at the following definition for
conditional differential entropy:
Z
h(X|Y ) = h(X|Y = y)f (y)dy.
y
By the same reasoning through which we showed h(aX) = h(X) + log |a| (see the previous
lecture) we can conclude that h(aX|Y = y) = h(X|Y = y) + log |a|. Therefore,
Z Z
h(aX|Y = y)f (y)dy = (h(X|Y = y) + log |a|) f (y)dy = h(X|Y ) + log |a|.
y y
Plugging these quantities into the definition of mutual information we have
I(aX; Y ) = h(aX) − h(aX|Y )
= h(X) + log |a| − h(X|Y ) − log |a|
= h(X) − h(X|Y ) = I(X; Y ).
As an exercise, the reader is encouraged to consider whether the above result holds for
any one-to-one function on X or Y . Why would such a result be important? Remember
that the capacity of a channel can be described in terms of mutual information. Thus,
the aforementioned result would imply that the capacity remains the same if we perform a
lossless pre or post-processing to the data we communicate.
15.3 Properties
15.3.1 Chain Rule for Differential Entropy
Let’s continue our exploration of differential entropy and ask another very natural question:
Does the chain rule hold? That is, does h(X1 , X2 ) = h(X1 ) + h(X2 |X1 )? As it turns out,
this result is true and the proof is nearly identical to our derivation of the original chain rule
for discrete random variables:
1
h(X1 , X2 ) = E log
f (X1 , X2 )
1
= E log
f (X1 )f (X2 |X1 )
1 1
= E log +E
f (X1 ) f (X2 |X1 )
= h(X1 ) + h(X2 |X1 ).
15-2
EE/Stats 376A Lecture 15 — March 2 Winter 2017
15.3.2 Differential Entropy and Divergence
What about divergence? As a refresher, we defined the divergence between two discrete
distributions p and q as
p(X)
D(p||q) , E log for X ∼ p.
q(X)
A very natural idea is to define divergence the same way in the continuous case, substituting
density functions into the appropriate places:
f (X)
D(f ||g) , E log for X ∼ f.
g(X)
One of the most important properties of relative entropy in the discrete case is non-negativity,
which we leveraged a number of times in proving other bounds. Thus, we would like to know
if the same result holds for relative differential entropy, and the answer turns out to be an
affirmative.
Theorem 2 (Cover & Thomas, page 252).
D(f ||g) ≥ 0.
Proof. The proof will be very similar to the discrete version we saw in an earlier lecture. Let
f and g be two probability density functions and let S denote the support of f . Then,
g(X)
−D(f ||g) = E − log
f (X)
g(X)
≥ − log E Jensen’s Inequality
f (X)
Z
g(x)
= − log f (x) dx
f (x)
≥ − log 1 = 0.
Corollary 1. For continuous random variables X and Y , I(X; Y ) ≥ 0.
Proof. Recall that we can write mutual information as a divergence between a joint density
function and the product of the marginal densities. Therefore,
I(X; Y ) = D(fX,Y (x, y)||fX (x) · fY (y)) ≥ 0
by Theorem 2
Corollary 2. For continuous random variables X and Y , h(X|Y ) ≤ h(X).
Note that h(X|Y ) = h(X) if and only if X and Y are independent.
15-3
EE/Stats 376A Lecture 15 — March 2 Winter 2017
15.3.3 Concavity of Differential Entropy
Theorem 3. h(X) is concave in the distribution of X.
Proof. Let X be a continuous random variable. Now, let’s define an auxiliary random
variable Y ∼ Bern(λ) and let
(
X1 if Y = 0
X= .
X2 if Y = 1
Thus, we have
h(X) ≥ h(X|Y )
= h(X|Y = 1)p(Y = 1) + h(X|Y = 0)p(Y = 0)
= λh(X1 ) + (1 − λ)h(X2 )
15.4 Entropy Maximization: Discrete Case
We now transition to entropy maximization. In this section, we review a result that we have
seen before regarding the discrete distribution that maximizes entropy. Let X be a discrete
random variable on the support set {1, 2, . . . , k}. Amongst all discrete distributions p of X
on {1, 2, . . . , k}, which one has maximum entropy?
As we’ve seen, the answer is the discrete uniform distribution U (1, 2, . . . , k). In a previous
homework, we showed this result using concavity and label invariance. In this section, we
will present another argument using the definition of entropy directly – a line of reasoning
we will use again later in this lecture when finding the entropy maximizing distribution in
the continuous case.
Theorem 4. For a given alphabet, uniform distribution achieves the maximum entropy.
Proof. Let U denote the discrete uniform distribution on {1, 2, . . . , k} as defined above and
let p denote an arbitrary discrete distribution on X. By the non-negativity of relative
entropy, we note that
D(p||U ) ≥ 0.
Thus,
X p(x) X
D(p||U ) = p(x) log = −H(X) − p(x) log U (x).
x
U (x) x
Now, note that U (x) = k1 for all values of X ∈ {1, 2, . . . k}. Thus, U (x) is in fact a constant
so we can pull it out of the above summation and write
X
D(p||U ) = −H(X) − log U (x) p(x).
x
15-4
EE/Stats 376A Lecture 15 — March 2 Winter 2017
P
Since p is a probability distribution, we observe
P that x p(x) = 1. Therefore, can substitute
P
another expression equal to 1 in place of x p(x). The key insight is to substitute in x U (x)
which is also equal to 1 since it is also a probability distribution. Applying this substitution
gives us X
D(p||U ) = −H(X) − log U (x) U (x) = −H(X) + H(U )
x
where we define H(U ) to be the entropy of a uniformly distributed random variable. Up to
this point, we have shown that
D(p||U ) = −H(X) + H(U )
and using the fact that D(p||U ) ≥ 0 we can conclude that
H(U ) ≥ H(X).
Since our choice of a probability distribution on X was arbitrary, this completes the proof.
15.5 Entropy Maximization: Continuous Case
Let’s now ask the same question for the continuous setting. That is, we would like to solve
the optimization problem
max h(X)
f
for a continuous random variable X. After a bit of thought, one realizes that this problem
is not very well formulated since the differential entropy can become arbitrary large. In
particular, if we take the density function f to be uniform on longer and longer intervals,
the differential entropy will approach ∞. To remedy this issue, we introduce an additional
constraint. Intuitively, the problem we ran into with our first attempt at formulating the
optimization problem is that we could keep “spreading out” a uniform distribution over
longer and longer intervals, which thereby increases the differential entropy. We observe that
as we stretch a uniform distribution over longer intervals, the variance of the distribution
increases as well. Thus, if we introduce a constraint that could control the variance of the
density function, then we might be able to circumvent the problem. To that end, we will
introduce a second moment constraint. Since the second moment E[X 2 ] is intimately tied to
variance, this will help us rectify our earlier issue. Our new optimization problem is now
max h(X)
f
subject to E[X 2 ] = α.
for some constant α. Now, we are ready to state and prove the main result of this section.
Theorem 5. The Gaussian distribution achieves maximum differential entropy subject to
the second moment constraint.
15-5
EE/Stats 376A Lecture 15 — March 2 Winter 2017
Proof. We’ll follow a similar outline to our prove that the uniform distribution achieves
maximum entropy in the discrete case. As we did previously, let’s start with divergence. Let
φ(x) denote the Gaussian density function with 0 mean and unit variance. Thus, φ(x) =
2
√1 e−x /2 . Let f denote an arbitrary density function on X. The divergence is then
2π
f (X)
D(f ||φ) = E log , X ∼ f.
φ(X)
Using the definition of differential entropy this becomes
1
D(f ||φ) = −h(X) + E log
φ(X)
1 −X 2 /2
= −h(X) + E √ e .
2π
Simplifying the second term of the above sum gives us
√ log e 2
D(f ||φ) = −h(X) + log 2π + E X .
2
Observe that the term inside the expectation in the above expression is a constant by our
second moment constraint. Using the same trick we employed in the discrete case proof, we
can thus change the distribution as long as we preserve the second moment. Thus, we will
replace f by a Gaussian density function. To make this substitution explicit, let XG denote
the random variable X now under the Gaussian distribution with mean zero and variance
α. With this substitution, observe that
1 1
E log = E log = h(XG ).
φ(X) φ(XG )
Therefore, we see that
D(f ||φ) = −h(X) + h(XG ) ≥ 0
by the non-negativity of divergence. Therefore,
h(XG ) ≥ h(X).
Remark: Also note that the Gaussian distribution is essentially the unique differential en-
tropy optimizer. Slightly more formally, if f maximizes the differential entropy subject to the
second moment constraint, then f is equal to the Gaussian distribution almost everywhere.
15.6 The Gaussian Channel Revisited
With our understanding of maximum entropy distributions, let’s turn our attention back to
the communication problem and conclude this lecture by proving the result stated during
the previous class that the capacity of a Gaussian channel is C = 12 log(1 + σP2 ).
15-6
EE/Stats 376A Lecture 15 — March 2 Winter 2017
Recall the setup of the Gaussian channel. For each input, Xi , the channel outputs Yi
i.i.d 2
where Yi = Xi + Zi where Zi ∼ N (0, σP ), also Zi ⊥ X. Furthermore, we impose a power
constraint for a block length n that n E [ ni=1 Xi2 ] ≤ P . The key insight we now make is that
1
this power constraint is nothing more than a second moment constraint that we worked with
in the previous section. If we formulate the channel capacity as an optimization problem as
we have done before, we have
max I(X; Y )
fX
subject to E[X 2 ] = P.
Expanding the definition of mutual information, we have
I(X; Y ) = h(Y ) − h(Y |X) = h(Y ) − h(Z)
where the last equality follows from the fact that Z does not depend on X. Thus, we can
rewrite our optimization problem as
max h(Y )
f
subject to E[X 2 ] = P.
which we can rewrite as
max h(X + Z)
f
subject to E[X 2 + Z 2 ] = P + σ 2 .
From the result of the previous section, we know that maximum entropy will be achieved if
we set X + Z be normally distributed. Since the sum of two Gaussians is a Gaussian, we can
achieve this by setting X ∼ N (0, P ) (recall that Z is Gaussian by assumption). As before,
let XG denote the random variable X under the Gaussian distribution. Applying this result
to the mutual information between X and Y we have
1 1
I(X; Y ) = h(XG + Z) − h(Z) = log 2πe(P + σ 2 ) − log 2πeσ 2
2 2
which simplifies to
1 P
log(1 + 2 ).
2 σ
Therefore, the capacity of the Gaussian channel is
1 P
C = max I(X; Y ) = log(1 + 2 )
2E[X ]≤P 2 σ
as we stated last lecture.
15-7